Today, many consumer services have significant online presence which makes it easy for consumers to leave feedback and reviews on the services rendered. Various popular NLP tasks are relevant to review comprehension, including aspect extraction (AE), aspect sentiment classification (ASC), and question answering (QA). Despite the progress made on these fronts, holistic understandings of reviews’ meanings often requires commonsense reasoning by the reader. In recent years, several pre-training techniques (Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019b) have shown the state-of-the-art (SOTA) performance on commonsense reasoning tasks (Zellers et al., 2018; Talmor et al., 2019; Huang et al., 2019; Zhang et al., 2018). However, these solutions are still inadequate in general for review comprehension as different domains tend to adopt languages and commonsense that are domain-specific. For example, the hotel review “The place is 800m away from the beach!” conveys positive information about its walking distance, convenience, and location and can be used to answer questions such as whether the hotel is close to the beach, whether it is within walking distance, or whether it is in a desirable location. It will be difficult to answer these questions without domain-specific commonsense.
Table 1 shows more examples of the type of commonsense that would be useful for accurately interpreting reviews in different domains. As our experiments show, these types of commonsense cannot be derived from popular commonsense knowledge bases such as ConceptNet (Liu and Singh, 2004), which also yields sub-optimal results for review comprehension tasks when compared to using our collected domain-specific commonsense.
Table 1: Examples of domain-specific commonsense.
More specifically, our contributions are:
• We developed XSENSE, a system that leverages domain-specific commonsense knowledge bases (KBs) to enhance BERT (Devlin et al., 2019)
for various review-reading comprehension tasks such as aspect extraction, aspect sentiment clas-sification, and question answering.
• We present a method to collect and organize domain-specific commonsense KBs with relatively low cost. Less than $700 was spent for each domain to collect KBs with roughly 6, 000 commonsense facts.
• We show that XSENSE consistently achieves competitive or SOTA performance for multiple review comprehension tasks with relatively small commonsense KBs across 3 different domains. Specifically, we gain 1.5 absolute F1 improvement for the QA task and outperform the SOTA models by up to 2.42 F1 and 3.18 Macro-F1 for the AE and ASC tasks respectively.
• To facilitate future research, we release three domain-specific KBs in the hospitality, restaurant and laptop domains. We also release an adversarial domain-specific question answering benchmark for the hospitality domain.
The rest of the paper is organized as follows. Section 2 provides an overview of XSENSE. Its architecture is described in Section 3 and we discuss our XSENSE KB construction method in Section 4. We demonstrate the advantages of using XSENSE KBs in our pipeline in our experiments in Section 5. Finally, we discuss related work in Section 6 and conclude the paper in Section 7.
XSENSE (Figure 1) takes a question and a review as input and returns a single span as the answer to the question. The architecture of XSENSE has three main components: (1) an opinion extractor, (2) a commonsense reasoning model, and (3) a reviewcomprehension model. The same architecture can be used for other reading comprehension tasks such as aspect extraction and aspect sentiment classifi-cation. The input and output for other tasks are different and minor adjustments are required as we explain in Section 3.
The opinion extractor is responsible for extracting spans of the input reviews that convey the reviewers’ opinions, such as “tasty sushi” and “short battery life”. The opinion extractor extracts such spans of opinions from the input review and forwards them to the commonsense reasoning model to figure out what each extracted opinion entails. In our implementation, we use the opinion extractor
Figure 1: Overview of XSENSE’s architecture.
from OpineDB (Li et al., 2019), which is the state-of-the-art tool for opinion mining from reviews.
The pre-trained commonsense reasoning model identifies what conclusions can be derived from the extracted opinions. For instance, “tasty sushi” might imply a “good Japanese restaurant”, and “short battery life” implies “poor quality”. We refer to the input extraction as a premise and the output of the commonsense reasoning model as a conclusion. The commonsense reasoning model has been trained to identify correct conclusions in a pre-training phase using available XSENSE KBs. In addition to a conclusion, the commonsense reasoning model also outputs an embedding for each premise. These embeddings encode the knowledge that the model has for input premises and can be used to enhance the performance of reading comprehension tasks.
Finally, the review comprehension model uses BERT to compute a representation for the input text. This representation is then augmented with the premise embeddings from the commonsense reasoning model to further enhance the output of the review comprehension model which, in this case, corresponds to identifying the answer span.
In short, the XSENSE pipeline is effective for (1) identifying the parts of texts that are good candidates for commonsense reasoning, (2) predicting what each extracted span from the review entails and encodes this knowledge in an embedding vector, and (3) using these embedding vectors along with BERT to produce better results.
In this section, we detail each component in XSENSE with the assumption that an XSENSE KB is available for the desired domain.
3.1 Opinion Extractor
The opinion extractor takes a review as input and outputs opinion tuples in the schema of (modifier, aspect). For example, given a review “The bathroom is very clean but the food is average.”, the extractor would extract {(very clean, bathroom), (average, food)}. The extraction pipeline in (Li et al., 2019) leverages two models: a sequence tagging model to identify the aspect and modifier spans and a sequence pair classifier to combine aspects with their corresponding modifiers.
3.2 Commonsense Reasoning Model
The goal of the commonsense reasoning model is to predict what conclusions can be derived from the input premise given as a (modifier, aspect) pair. This is done by creating an embedding for each input premise to encode the possible conclusions the input entails. Note that since conclusions are derived from these embeddings, premises with similar conclusions tend to have similar embeddings.
To obtain these premise embeddings, our reasoning model follows a standard sequence-to-sequence model, with a 50-dimensional embedding layer and a 768-dimensional hidden layer of a gated recurrent unit (GRU) (Cho et al., 2014) for both the encoder and the decoder. The embedding layer is initialized with GloVe word embeddings (Penning- ton et al., 2014). Given an XSENSE KB, which follows the schema shown in Table 1, we train the model with each premise-conclusion pair as a pair of input-output sequences.
Note that there are many techniques for knowledge-base embedding (Yang et al., 2015a; Nickel et al., 2012; Trouillon et al., 2016) which in theory could have been used in XSENSE to embed the commonsense knowledge. However, these techniques only compute embeddings for entities that are present in the knowledge-base and cannot generalize beyond those entities. In our case, entities in the knowledge base are opinions expressed in natural language form, and thus by using a sequence-to-sequence model, we can generalize beyond what appears in our knowledge-base. For instance, even if the phrase “fresh nigiri” does not appear in the XSENSE KB for the restau-
Figure 2: Overview of BERT representations
rant domain, our approach infers that this premise implies “good Japanese restaurant” because the phrase “fresh sashimi” has the same conclusion, and there is a high degree of similarity between the two premises according to word embeddings.
3.3 Review Comprehension Model
The review comprehension model extends BERT to incorporate the embeddings obtained from the commonsense reasoning model. In what follows, we overview BERT’s architecture for each review comprehension task and explain how the embeddings are utilized by XSENSE.
Overview. Figure 2 shows how BERT processes the input text to produce a representation for each token (shown in blue) as well as the entire provided text. Note that for each token in the input text, BERT outputs a representation (of size 768 or 1024). Besides the tokens present in the text, BERT uses two special tokens CLS which is used to encode a representation for the entire input text and SEP which is used to signal BERT about spe-cific aspect of the task at hand. For instance, for question answering tasks the SEP token is used to separate the tokens of the input question from the tokens of the input text.
To use BERT for different NLP tasks, a final layer1 is added on top of the learned representations. For instance, for sentiment analysis, a single dense layer is added on top of the CLS token which predicts the sentiment of the input text.
The XSENSE KB follows the same approach, but augments the BERT representations with embeddings obtained from the commonsense reasoning model. More specifically, the representation of each token from a sentence s is appended with the embedding of an opinion extracted from sentence s. If the opinion extractor has no opinions mined from sentence s, a vector of all zeros is appended instead. If there are multiple opinions extracted from a sentence, we simply pick the first extraction and append its embedding.
Aspect Extraction. This task identifies the tokens in a given review that are aspects of the item or the service being reviewed. For instance, “food” in the review “The food was tasty, but ...” is an aspect to be extracted. The input to this task is the CLS token followed by the tokens of the review. To predict which tokens should be extracted, a single dense layer is added on top of the BERT representations which are augmented by adding the commonsense embeddings to these representations. The dense layer outputs the probability of whether or not each token is part of an aspect span.
Aspect Sentiment Classification. The input to the aspect sentiment classifier is a review along with a span marked in the review as the targeted aspect. The goal is to predict whether the reviewer’s opinion on the aspect is positive, negative, or neutral. The input is provided to BERT in the same manner as the aspect-extraction task with one minor adjustment: the targeted aspect is appended to the original review after a SEP token. To predict the expressed sentiment, a dense layer is often added on top of the CLS token which is fine-tuned during the training process. However, XSENSE augments the CLS representation by adding the commonsense embedding of the input text. A dense layer is added on top of this augmented representation to make the final prediction.
Question Answering. Given a question and a review (which is assumed to contain the answer to the question), the goal is to find the span in the review that can be served as the answer to the question. This input is fed to BERT by separating the question and the review using a SEP token.
To identify which span has the highest likelihood of being the correct answer, two single dense-layer classifiers are added on top of the BERT representations of each token appended with their associated commonsense embeddings. The two classifiers compute the likelihood of each token being at the start and at the end of the answer span respectively. Based on these probabilities the span with the highest likelihood of being the answer is extracted.
Here we present our technique for creating a XSENSE KB from a corpus of reviews. Our goal is to understand what conclusions a certain expressed
Figure 3: Representations used for KB construction
opinion entails. For instance, “fresh sashimi” often implies a “good Japanese place”, but building such knowledge bases is not trivial for several reasons. First, these relationships are rarely mentioned explicitly in reviews. Moreover, such relationships, while generally true, are not completely factual as there could also be a “low quality Japanese restaurant” that serves “fresh sashimi”. Despite these challenges, we show how the unique structure of review corpora enables us to mine these relationships effectively.
We start by applying the opinion extractor to obtain all (modifier, aspect) from the reviews. We then create two representations of the data:
Extraction Matrix: We create a matrix M where each row i corresponds to a product or service i being reviewed and each column j corresponds to a unique (modifier, aspect) pair extracted by the opinion extractor. Each entry denotes the number of times that the (modifier, aspect) pair j has been observed in reviews of item i.
Modifier-Aspect Tensor: In a similar manner, we create a tensor T with three dimensions corresponding to the items, the modifiers, and the aspects extracted from the reviews. Each entry denotes the number of times that modifier j on aspect k has been observed in reviews of item i.
Figure 3 illustrates how the reviews and all extracted (modifier, aspect) pairs are organized. Using these data representations, we compute a dense representation for each modifier-aspect pair using tensor factorization techniques as follows. To decompose matrix M, we represent each item i and each (modifier, aspect) pair j with d-dimensional vectors such that their inner product, denoted as
, would be a good approximation of
. More specifically, we compute these vectors such that
is minimized2. To decompose the modifier-aspect tensor T we follow a similar approach and assign d-dimensional
Figure 4: An example of the candidate verification task.
vectors , and
to each item i, modifier j, and aspect k such that the sum of their Hadamard products, denoted as
would be a good approximation of
. As before, we compute these representation vectors such that
is minimized. These vectors are computed using a PARAFAC factorization technique (Harshman and Lundy, 1994) and we use the implementation provided by Tensorly3.
Note that decomposing the modifier-aspect tensor produces representations for each modifier and aspect separately. To obtain a representation for the pair consisting of modifier j and aspect k, we use their Hadamard product (i.e., ). Once dense representations for all (modifier, aspect) pairs are computed, we create a commonsense KB through the following two steps: Candidate Generation: In this step, we create a set of candidate premise-conclusion pairs. More specifically, for each (modifier, aspect) pair p, we find 3 other (modifier, aspect) pairs whose representations have the highest cosine similarity with that of p. Also to ensure that candidate premises and conclusions are different enough, we also find the most similar embedding with a distinct modifier and aspect from the pair p. Note that pairs with similar representations are pairs that appear with similar distribution across all items, and thus are quite likely to be related. The candidates mined in this step are then forwarded to human annotators for verification. Verification: In this step, the annotators receive a pair of extractions and are asked to identify if the pair is unrelated, equivalent, or if one implies the other. Figure 4 shows an instance of our verification task and how it was shown to human annotators.
Note that we can use either the modifier-aspect tensor or the extraction matrix for creating the XSENSE KB. However, we use both data structures in conjunction as we observed that the extraction matrix yields better results for frequent (modifier, aspect) pairs, and the modifier-aspect tensor produces better results for (modifier, aspect) pairs in the long tail. Thus by combining the results from both structures we achieve a good set of candidates across the board.
In this section, we present our evaluation setting, introduce the datasets used including our new adversarial QA dataset for the hospitality domain, and discuss the performance of our system XSENSE compared to a number of baselines. Our experiments demonstrate two key results: (1) XSENSE KBs contain commonsense information that cannot be derived from ConceptNet, the most popular commmonsense KB and (2) XSENSE KBs improve review comprehension; XSENSE, which utilizes XSENSE KBs, outperforms the state-of-the-art models on multiple review comprehension tasks. To facilitate future research, we are making all three constructed XSENSE KBs, and our adversarial QA dataset publicly available online4.
5.1 Constructed XSENSE KBs
We have created three XSENSE KBs5 for improving review comprehension. Table 2 shows the overall statistics of the collected KBs. The first two rows denote the corpora and the specific subset of the data that were used for creating each XSENSE KB. Once all modifier-aspect pairs were extracted from the reviews, we picked a subset of the most reviewed entities as well as a subset of the most frequent extractions to form the extraction matrix as well as the modifier-aspect tensor as described in Section 4. The number of selected entities and extractions are listed in the third row. The next two rows show the final number of opinions in the knowledge-base and the number of relationships discovered between them accordingly. The last two rows in the table demonstrate to what extent the contents of our constructed knowledge-base can be obtained from ConceptNet. The extraction overlap is the percentage of extracted opinions that can be directly found in ConcpetNet. For instance, while “thin walls” appears in ConceptNet, most extracted opinions such as “noisy room” are missing. The relation overlap denotes to what extent the facts in our XSENSE KBs can be derived indirectly from ConceptNet. Of course, since “noisy room” is absent from ConceptNet, we cannot derive its relationship to “thin walls” directly. Instead, we look to see if there is a relation in ConceptNet between the mod-ifiers of each premise and conclusion as well as their aspects. In this case, while there is an edge between “walls” and “rooms”, there is still no relation connecting “noisy” to “wall” in ConceptNet.
Table 2: RestaurantSense, LaptopSense, HospitalitySense.
5.2 Commonsense KB evaluation datasets
To measure the value of using commonsense for review comprehension, we evaluate XSENSE on two public aspect-based sentiment analysis (ABSA) dataset where each consists of an aspect extraction (AE) task as well as an aspect sentiment classifica-tion (ASC) task. Moreover, we create a QA dataset for the hospitality domain which is more challenging than existing QA datasets for reviews (as shown by the low F1 scores achieved by the state-of-the-art systems). This is because existing QA datasets for reviews are often constructed by matching reviews with questions using IR techniques and consequently, questions and answer spans tend to exhibit a large similarity.
We discuss next how our collected dataset avoids this bias, and then describe briefly the public datasets that are used in our experiments.
HotelQA dataset. We created an adversarial QA dataset (HotelQA) of 757 data entries where each question requires commonsense reasoning in the hospitality domain to answer. Similar to SQUAD V1.1 (Rajpurkar et al., 2016), each data entry of HotelQA is a tuple (review, question, answer) where answer is a sentence span within review. On average, each review has 138.6 words, each question has 5.8 words, and each answer has 19.6 words. The dataset is more challenging because we ensured questions regarding a specific topic (e.g., parking) should be paired with reviews that mention the same topic at least three times – i.e., it is adversarial towards machine learning models – but the concrete sub-topic (e.g., parking fee) is not mentioned explicitly in the review – i.e., it requires commonsense reasoning to answer correctly. An example QA tuple is (Review: “...The best was the
pre-paid parking. I booked on Expedia and included parking. A great deal! Parking was just behind the hotel and connected
...”, Question: “Do you have parking nearby?”, Answer: “Parking was just behind the hotel and connected.”) The HotelQA dataset is separated into a training set of 681 QA pairs (90%) and a validation set of 76 QA pairs (10%).
ABSA datasets. We evaluate XSENSE on four ABSA datasets. The datasets cover two domains (laptops and restaurants) and consist of two tasks, AE and ASC. All four datasets are from SemEval competitions (Pontiki et al., 2014, 2016). Table 3 summarizes the statistics of these datasets. We split the datasets into training/validation sets following the settings of (Xu et al., 2019), where 150 training examples are held for validation.
Table 3: Statistics for the ABSA datasets. S: number of sentences; A: number of aspects; P, N, and Ne: number of positive, negative and neutral polarities.
5.3 Experimental setup
Next, we describe our experimental setup for each task and describe the baseline methods. To demonstrate the importance of building domain-specific commonsense knowledge-bases, we compare our results with an adapted version of XSENSE that uses embeddings from ConceptNet, which we refer to as XSENSE(CN). We start with a description of this baseline (which we use in all experiments), and then continue introducing our task-specific baselines and experimental setup.
xSense(CN). XSENSE(CN) operates in the same manner as XSENSE. The only important change is that the commonsense embeddings do not come from our commonsense reasoning model. Instead, we obtain the embeddings by applying a KB embedding technique to ConcpetNet. More specifically, we took the English subset of ConceptNet with 873K entities and 1.5M relations, and then embed them using the DISTMULT (Yang et al., 2015b) technique for KB embedding. We use OpenKE6 and their proposed default configuration to train the model.
HotelQA setup and baselines. For this task, XSENSE is implemented on top of a BERT QA model fine-tuned on SQUAD V1.1. This QA model uses a pre-trained BERT-large model (Wolf et al., 2019) of 24 layers with 110M parameters. XSENSE extends the original 1,024-dimensional representation with a 768-dimensional KB vector. We compare XSENSE to a baseline BERT+SQUAD which is a BERT model using the same configuration as XSENSE. The other baseline is XSENSE(CN) as described above. All models are trained on the training dataset for 10 epochs with a learning rate of 3e-6 and evaluated on the evaluation dataset. We train each model 5 times and report the average and the standard deviation of the best F1 and exact-matching scores of each run.
ABSA setup and baselines. We compare XSENSE with BERT-PT (Xu et al., 2019), the SOTA method for AE and ASC. BERT-PT improves the vanilla BERT model by concurrently fine-tuning the 12-layer BERT-based model on an in-domain corpus and on a reading comprehension dataset. We reproduce the results of BERT-PT by fine-tuning the same BERT model on in-domain corpora – 1.17 million sentences (He and McAuley, 2016) for the laptop domain and 2 million sentences (Yelp) for the restaurant domain. We use the resulting models (denoted as in-domain BERT) as a baseline which already has similar or even better performance compared to BERT-PT.
We also use XSENSE and XSENSE(CN) to incorporate KB embeddings. Note that since AE and ASC are part of the opinion extraction pipeline, we avoid the interference of having a too powerful opinion extractor by assuming a much weaker extractor: it simply takes all aspect/modifier tokens that appear in the XSENSE KB as the opinion.
For all models trained based on the ABSA datsets, we fine-tune BERT for 20 epochs with a learning rate of 5e-5. We select the model with the best performance (F1 for AE and MF1 for ASC) on the validation set and report the performance on the test set. We repeat each experiment 5 times and report the average.
Table 4: Results on HotelQA with standard deviation.
5.4 HotelQA and ABSA results
HotelQA results. Table 4 shows the results of comparing XSENSE with the two baselines. We report (1) the token-wise F1 scores which measure the overlap between the predictions with the golden answer and (2) the exact-matching scores – the percentage of predictions that match exactly. In Table 4, XSENSE improves the base QA model by a significant 2.5% and by 1.3% more compared to ConceptNet. We inspected the output of each model and show an example QA where XSENSE outperforms the baseline models in Table 5.
ABSA results. We summarize the results on the four ABSA datasets in Table 6. We measure the model performance on AE tasks using F1 and the model performance on ASC tasks using both accuracy and macro-F1. XSENSE consistently outperforms BERT-PT (SOTA) on all datasets. The improvements range from 0.18 (F1 for Laptop AE) to 3.18 (MF1 for Restaurant ASC). The improvement is higher for the restaurant domain. Intuitively, this is because the restaurant KB is of better quality. Moreover, we notice that ConceptNet hurts the BERT performance on most cases while XSENSE improves the baseline model both more significantly and consistently. These results clearly show that the domain-specific knowledge captured by the XSENSE KB is beneficial to ABSA tasks.
Commonsense Reasoning Tasks. Several commonsense reasoning tasks have been proposed: SWAG (Zellers et al., 2018), CommonsenseQA (Talmor et al., 2019), and Cosmos QA (Huang et al., 2019) are multiple-choice QA tasks, and ReCoRD (Zhang et al., 2018) is a cloze-style QA task. Those datasets were carefully curated to exclude easy questions that text processing systems can answer by exploiting lexical heuristics. To the best of our knowledge, HotelQA is the first span-extraction QA dataset for commonsense reasoning that has been published to date.
Review: ... The size of the bathroom is the only downside I’ve found. Very small! However, plenty of hot water (showerhead was not working very effectively...) and very clean bathroom and towels. The morning receptionist (I forgot to ask him his name, but I thank him again) was very nice and accepted to keep our car in the hotel parking until 3:00 p.m. at no charge. This allowed us to go shopping on St. Denis Street without being forced to find a pay parking or to run to a parking meter every 2 hours...
Table 5: An example QA pair where XSENSE outperforms baselines. It is likely that the baseline models picked the “However ...” span because it contains “plenty of” which also appears in the question. xSense avoids this span perhaps because its “bathroom” concept was strengthened by the commonsense vector of (“very clean”, “bathroom”).
Table 6: Aspect Extraction (AE, left) and Aspect Sentiment Classification (ASC, right) results. The BERT-PT numbers are taken from (Xu et al., 2019). P: precision, R: recall, MF1: Macro-F1. The standard deviation is 1.51 for AE and 1.06 for ASC.
External Knowledge Integration. A popular approach to integrating KBs into NN models is to integrate embeddings obtained from the KB into the model. KB-LSTM (Yang and Mitchell, 2017) incorporates external knowledge by adding knowledge embeddings obtained from WordNet into RNN-LSTM. Yang et al. (2019a) applied a similar idea to BERT. Mihaylov and Frank (2018) used an attention mechanism to integrate relevant external knowledge for cloze-style reading comprehension. Lin et al. (2019) developed a method that uses schema graph construction for KB embedding. All said techniques require a KB-retrieval function to find corresponding information from the KBs.
Other approaches use an auxiliary model or data as additional evidence to the main model. Emami et al. (2018) collected texts from the web using a query augmented from the input text to improve the performance on Winograd Schema Challenge (WSC). Rajani et al. (2019) created a dataset of explanations and developed a framework that uses a language model trained to generate an explanation as an auxiliary input to the main QA model.
Our framework is different from the two approaches above. It uses a general-purpose opinion extractor and a seq2seq model that can take any input, including ones that do not explicitly appear in the KB. In contrast to the second approach, our framework directly integrates the auxiliary information into the model, and does not provide it as part of the input text.
Automated KB Construction. Our automatic KB construction approach is closely related to Universal Schema (Yao et al., 2013; Verga et al., 2015), which is a matrix factorization technique for relation extraction. Other matrix factorization techniques for KB construction include (Nickel et al., 2011) and (He et al., 2015). A major difference, however, is that we also use tensor factorization to model aspects and modifiers separately.
Review Comprehension Xu et al. (2019) introduced the Review Reading Comprehension task, and created a new reading comprehension dataset based on crowd-sourced questions on reviews in the ABSA datasets (Pontiki et al., 2015). We demonstrate that XSENSE outperforms their approach and achieves the SOTA results (see Table 6).
We establish that domain-specific commonsense can noticeably improve multiple review comprehension tasks that conventional commonsense knowledge bases cannot. We develop XSENSE, a system that can exploit relatively small domain-specific knowledge bases on top of transformerbased language models for review comprehension and establish its effectiveness through an extensive set of experiments. To facilitate further research, we also publicly release three domain-specific knowledge bases, in the domains of hospitality, restaurant, and laptops, and release a questionanswering benchmark for the hospitality domain.
Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bah- danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ali Emami, Noelia De La Cruz, Adam Trischler, Ka- heer Suleman, and Jackie Chi Kit Cheung. 2018. A knowledge hunting framework for common sense reasoning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1949–1958, Brussels, Belgium. Association for Computational Linguistics.
Richard A. Harshman and Margaret E. Lundy. 1994. Parafac: Parallel factor analysis. Comput. Stat. Data Anal.
Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW, pages 507–517.
Wenqiang He, Yansong Feng, Lei Zou, and Dongyan Zhao. 2015. Knowledge base completion using matrix factorization. In Asia-Pacific Web Conference, pages 256–267. Springer.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos QA: Machine reading comprehension with contextual commonsense rea- soning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
Yuliang Li, Aaron Xixuan Feng, Jinfeng Li, Saran Mu- mick, Alon Halevy, Vivian Li, and Wang-Chiew Tan. 2019. Subjective databases. PVLDB, 12(11):1330– 1343.
Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xi- ang Ren. 2019. KagNet: Knowledge-aware graph networks for commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2822–2832, Hong Kong, China. Association for Computational Linguistics.
H. Liu and P. Singh. 2004. Conceptnet — a prac- tical commonsense reasoning tool-kit. BT Technology Journal, 22(4):211–226.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Todor Mihaylov and Anette Frank. 2018. Knowledge- able reader: Enhancing cloze-style reading compre- hension with external commonsense knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 821–832, Melbourne, Australia. Association for Computational Linguistics.
Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2011. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816.
Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. 2012. Factorizing yago: Scalable machine learning for linked data. In Proceedings of the 21st International Conference on World Wide Web, New York, NY, USA. ACM.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In In EMNLP.
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, ALSmadi Mohammad, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orph´ee De Clercq, et al. 2016. Semeval-2016 task 5: Aspect based sentiment analysis. In SemEval-2016, pages 19–30.
Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion Androutsopoulos. 2015. Semeval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 486–495.
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35.
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, and Richard Socher. 2019. Explain yourself! leveraging language models for commonsense rea- soning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. CommonsenseQA: A ques- tion answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
Th´eo Trouillon, Johannes Welbl, Sebastian Riedel, ´Eric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ICML’16.
Patrick Verga, David Belanger, Emma Strubell, Ben- jamin Roth, and Andrew McCallum. 2015. Multilingual relation extraction using compositional universal schema. arXiv preprint arXiv:1511.06396.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv, abs/1910.03771.
Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2019. BERT post-training for review reading comprehension and aspect-based sentiment analysis. In NAACL-HLT 2019.
An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Wu, Qiaoqiao She, and Sujian Li. 2019a. En- hancing pre-trained language representations with rich knowledge for machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2346–2357, Florence, Italy. Association for Computational Linguistics.
Bishan Yang and Tom M. Mitchell. 2017. Leveraging knowledge bases in LSTMs for improving machine reading. In ACL 2017, pages 1436–1446.
Bishan Yang, Scott Wen-tau Yih, Xiaodong He, Jian- feng Gao, and Li Deng. 2015a. Embedding entities and relations for learning and inference in knowledge bases. In Proceedings of the International Conference on Learning Representations (ICLR) 2015.
Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2015b. Embedding entities and relations for learning and inference in knowledge bases. In ICLR ’15.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Ruslan Salakhutdinov, and Quoc V Le. 2019b. XLNet: Generalized autoregressive pre-training for language understanding. arXiv preprint arXiv:1906.08237.
Limin Yao, Sebastian Riedel, and Andrew McCallum. 2013. Universal schema for entity type prediction. In Proceedings of the 2013 workshop on Automated knowledge base construction, pages 79–84. ACM.
Yelp. The Yelp Dataset Challenge https://www. yelp.com/dataset.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adversar- ial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 93– 104, Brussels, Belgium. Association for Computational Linguistics.
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885.