A synonym is a word having the same or nearly the same meaning as another word. Due to the complexity of natural language, synonym happens very frequently and brings much challenge to many applications. For example, a NLP system in clinical domain should understand that “oxyphenbutazone” and “Tandearil” are of the same medicine, otherwise it may generate misleading interpretation or even answer patients’ questions wrongly.
However, it is far from a trivial task to identify synonyms. Many studies in this direction have started out with intuitively appealing ideas, but in practice, these intuition based approaches always meet exceptions. As discussed in (Norlindh, 2012), sometime it is difficult to extract synonym even for professional lexicographers. Moreover, as we will show in this paper, recognizing synonym on large scale datasets is much more difficult than that in small scale.
Table 1: Examples of synonym and non-synonym words
Table 1 shows a few examples of synonym and non-synonym. Extracting such synonyms is important in a number of applications:
• Searching engine: The vast number of users of search engines may use different terms for the same query concept. Thus a crucial step in search engines is query expansion, for which extraction of synonyms plays an important role (Bernhard, 2010). For example, if one query is “Turks-cap”, the search engine should know it is a flower like “Lilium martagon”.
• Clinical text: In health and related applications, we also need match different terms used in books and clinical realities (Henriksson et al., 2014). For example, the symptoms from patients’ own description may appear different from the clinical books, but they may refer to the same disease. Moreover, clinical text may change over time, and it is important to recognize synonyms from historical records.
• Question Answering: Question Answering (QA) aims to automatically answer questions posed by humans in a natural language. Ferrucci et al. (2010) showed it is possible to analyzes natural language questions and content well enough and fast enough to compete and win against champion players at Jeopardy! game. To handle the ambiguous questions, many modules in the QA system (Ferrucci, 2012) must take the synonym into consideration.
The purpose of this research is to develop a synonym extractor which can automatically explore a large corpus and complement existing thesauri or synonym annotations. Intuitively, we can ask lexicographers to think or analyze corpora by themselves to find synonyms candidates. However, manual labeling is both costly and challenging, often resulting in low coverage of synonym in large corpus. Even worse, when we come to a new domain, we may not get enough lexicographers’ annotations in time. A better approach is to let lexicographer annotate a training set of synonyms, from which we can train a detector to find more synonyms1.
There have been a number of research studies on synonym extraction. Wang & Hirst (2009) explored dictionary definitions to extract synonyms. They compared rule based methods and maximum entropy method, and surprisingly found that rule based method outperforms its counterpart by large margin. Henriksson et al. (2014) studied synonym in medical datasets. One shared limitation in these studies lies in the fact that the datasets are very small. The medical synonym datasets used by (Henriksson et al., 2014) is limited 340 synonym pairs. The experiments in Wang & Hirst (2009) are limited to TOFEL synonym test with 80 questions. Collobert & Weston (2008) considered synonym, as a related task to improve the performance of semantic role labeling, but did not report the performance of synonym extraction in large scale.
In this paper, we build a synonym extraction datasets with 3.4M term pairs, which is significantly larger than those in previous works. We test several approaches including linear SVM, multilayer perceptron, and a new verification feature learning based neural network. The results show that our feature learning based neural network outperforms the baseline by relatively 97% in the large scale dataset.
Our synonym dataset is obtained by expanding the synonyms and antonyms from WordNet (Miller, 1995). We first collect synonym and antonym pairs, then exclude some extremely unpopular words. As a result, we get 47K synonym and antonym pairs. The ratio between synonym and antonym is about 10:1. They are used in our experiments as positive and negative examples.
In practice, we need distinguish synonym from a large amount of irrelevant pairs. To model this phenomenon, we introduce a huge amount of unrelated terms as negative pairs. Table 2 demonstrates an example of expanding negative pairs. We choose the ratio of synonym over the whole dataset as 1:100. Finally we get 3.4M term pairs, of which 1% are positive and 99% are negative, including both antonym and irrelevant.
The motivation of building such a highly unbalanced datasets is two-fold:
• During the training stage, such a large dataset provides enough training samples and hence benefits complex models and automatic feature learning.
• During the testing stage, our dataset reflects the setting that synonyms in practice are sparse compared with the large amount of irrelevant pairs. A large testing set simulates better the challenge of synonym extraction in practice, compared with the small training sets used in previous studies.
Table 2: Illustration of expanding non-synonym pair
Evaluation: To evaluate the performance of our synonym extraction, we divide the dataset into two parts: 33% for training, and the other 67% for testing. There are 1.1M training pairs and 2.3M testing pairs, which we believe will cover a wide range of word concepts.
Since our dataset is highly unbalanced, it makes less sense if we use classification accuracy as the evaluation criteria. Instead, we choose F1 score, which is the harmonic mean of precision recall in the testing set. F1 score is widely used in many QA system modules (Ferrucci, 2012).
Our baseline approach is to train a SVM model over the distributed representations of words which can be learned from a corpus in an unsupervised manner. One of the earliest use of word representations was done by Rumelhart et al. (1986). This idea has since been applied to statistical language modeling (Bengio et al., 2003), which motivated a number of research in speech recognition and machine translation, as well as a wide range of NLP tasks, including (Collobert & Weston, 2008) (Glorot et al., 2011) (Socher et al., 2011) (Mnih & Teh, 2012) (Lebret & Collobert, 2014).
The idea of using distributed representation is to employ the co-occurrence of words and phrases to represent semantics (Turney & Pantel, 2010). A number of researchers (Bengio et al., 2003) have employed recurrent neural network to learn a distributed representations of words, which encode many linguistic regularities and patterns in a vector space (Mikolov et al., 2013b). For example, the result of a vector calculation vec(Madrid) - vec(Spain) + vec(France) is closer to vec(Paris) than to any other word vector (Mikolov et al., 2013b).
Although the recurrent neural network can achieve good performance in natural language processing tasks, not all of them are scalable to very large corpus. We prefer an efficient method for arbitrary size of documents, so we employ the skip-gram model proposed by (Mikolov et al., 2013a) which learns word vectors efficiently by a two layer network from the context words in a sentences. The implementation of Mikolov et al. (2013a) is extremely efficient: we can learn vector representations for billions of words from the whole Wikipedia corpus on a server with 5 CPUs within two hours. We choose dimension=100 for the word representation.
The unsupervised learning of distributed representation helps us to develop a synonym classifier for either general purpose or a specified domain. We can first learn the vector representation from a large corpus, and use this presentation as features to learn a synonym extractor. Our baseline is to train a linear SVM with the distributed representation. In practice, we use the liblinear package by Fan et al. (2008). To handle the unbalanced problem of positive pairs, we use a sampling weight of 100 for positive class in the 3.4M dataset.
We train the linear SVM model on two datasets. A small dataset has 47K pairs with synonyms and antonyms only. The extended dataset incorporates a lot of irrelevant pairs as the negative pairs, which results in a dataset with 3.4M pairs. Table 3 shows the performance of our baseline approach for synonym recognition. Both datasets use 33% for training and the remaining 67% for testing. It is easy to see that the word vector + SVM approach works very well for the small dataset, but it is hard to obtain good performance on large datasets.
Table 3: Performance of our baseline approaches
Table 4: Examples of synonym pairs wrongly predicted by our baseline.
Table 4 shows some failure examples using our baseline algorithms. We can see that synonym extraction is a difficult problem by its nature. Some words, like “capital”, “insulation”, and “groin” have multiple meanings; Some words, like “‘VIII” are only used in certain documents. When it comes to large scale, the learning problem become even more difficult due to several factors: the learning problem is highly unbalanced, the difference feature is not good enough to capture diversity in large scale word representation. Although models of distributional semantics provide potentials of supporting development of such resources, the ability of isolating synonymy from other semantic relations in large corpora is limited. In the next section, we will discuss how to overcome these challenges and how to build better synonym extractors.
Figure 1: Demonstration of our cost function.
To explain our approach, let’s first denote a pair of words represented as and
, and the label of this pair is y. We say y = 1 if
and
is synonym, and
for non-synonyms. A synonym extractor will generate an estimation
which is hope to be close to y. Before discussing our model, we would like to introduce our cost function:
where denotes the estimation. Our cost function can be viewed as a modified version of mean square error (MSE), whose cost is simply
. Such cost function prefers estimations with good F1 scores. However, unlike traditional MSE, our new cost does not require
be exactly as +1 or
. Given a training sample with
for y = 1 or
for
, we think the estimation is good enough and will count as zero cost. Many popular cost functions, including MSE, cross entropy and hinge loss both suffer from unbalanced problem, for which we have to assign a hyper parameter as sampling weight for unbalanced categories. However, there is no hyper-parameter in our cost function because a majority of samples in the dominant categories are overlooked when they satisfied the constraint
. Figure 4 plots the cost function for
and y = 1, respectively. It is easy to see that our function is also smooth with continuous gradient.
4.1 NEURAL NETWORK WITH HAND-DESIGNED FEATURES
As shown in previous section, linear SVM cannot model the nonlinear relationship and does not work effectively in large scale. On the other hand, Nonlinear SVMs , are too expensive in large scale datasets, because empirically the number of support vectors grows with the number of training samples. Our first attempt is to employ multilayer perceptron (MLP) to replace SVM. Following the large amount of previous work, we employ a two layer MLP with hyperbolic tangent activation function as the classification model.
The input of MLP can be of many choices. Suppose the input is one pair of words represented by and
. The most straightforward representation is to consider the difference
, or use the absolute value in every dimension
. If we only use the difference as the input, the MLP in fact is a nonlinear generalization of Mahalanobis distance
.
However, the difference vector does not keep all the information for the synonym classi-fication. As suggested by Li et al. (2013), such models in fact impose a global decision threshold without considering the local characteristics of
or
. We propose to add more relationship and obtain more features for the pair, including
, together with
and
individually. Here
and
all denote element-wise operation. As shown in Figure 2 we can concatenate these hand designed vector into a long vector and build a neural network to learn the synonym. Suppose
and
are both of d dimensional, such a concatenated vector will be of 5d. The neural network is trained using stochastic gradient decent (SGD), where the gradient can be computed efficiently using back propagation.
Figure 2: MLP using hand-assigned features.
Table 5 compares the performance of individual features as well as the combined features. It is easy to see that by combining multiple hand-assigned features, we can greatly improve the performance of synonym extraction.
Table 5: Comparing different hand designed features with MLP
4.2 DEEP NEURAL NETWORK WITH FEATURES LEARNING
Neural network has been recognized by its potentials in learning features instead of using handcraft features. We have seen a number of successful examples including unsupervised learning (Coates et al., 2011) or supervised convolutional network (LeCun et al., 1998). In this paper, we propose an automatic approach to learn the feature representation instead of hand-assigned features . As shown in Figure 4.2, we first build a neural network which can take one pair of input features and learn a set of relationship automatically. These learned features are mapped into a nonlinear space and then used as the input of a multi-layer perceptron.
Figure 3: The framework of our synonym extraction system.
Our learned feature can be represented in the following form:
It is easy to see that Eq. (2) is a general form of all the hand-assigned features: when , this is a function of input difference
; when
, this becomes a function of the inner product as
.
Eq. (2) can be viewed as a special convolutional filter on the input
. In practice, our network use a three-dimensional tensor as the input. Dimensions of the tensor input are the batch size, the dimension of the distributed representation, and the size of input sample (as 3). Our algorithm is implementation based on Theano (Bergstra et al., 2010), where the layer definition is illustrated as Algorithm 1. To train the model, we use the standard stochastic gradient with a fixed learning rate 0.04.
Our feature learning based neural network achieves the best performance on our synonym recognition dataset. Table 6 compares the performance with hand-assigned features and SVM baseline (also using ). Our new approach significantly improves the F1 score of the baseline by 97%. Also it is interesting to find out that our feature learning approach outperforms the hand-assigned features in large scale, however, it does not show many advantages in the small scale dataset with 47K word pairs. These results suggest that not only powerful models but enough training samples are important in feature learning.
Table 6: Performance of our new approaches
This paper describe our efforts towards extracting synonyms in large scale settings. We employ the distribute representation to represent word pairs and propose a feature learning neural network with a new cost function to recognize synonyms. The experimental results on a dataset with 3.4 million synonym/non-synonym pairs show the feature learning approach outperforms the hand-assigned features. Our best performance improves the SVM baseline by relative 97%. Our future work is to improve the synonym extraction in different domains for QA systems.
Bengio, Yoshua, Ducharme, Rejean, Vincent, Pascal, and Janvin, Christian. A neural probabilistic language mode. The Journal of Machine Learning Research, 3:11371155, 2003.
Bergstra, James, Breuleux, Olivier, Bastien, Fr´ed´eric, Lamblin, Pascal, Pascanu, Razvan, Des- jardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a CPU and GPU math expression compiler. In SciPy, 2010.
Bernhard, Delphine. Query expansion based on pseudo relevance feedback from definition clusters. In COLING, pp. 54–62, 2010.
Coates, Adam, Ng, Andrew Y, and Lee, Honglak. An analysis of single-layer networks in unsuper- vised feature learning. In AISTATS, pp. 215–223, 2011.
Collobert, Ronan and Weston, Jason. A unified architecture for natural language processing: deep neural networks with multitask learning. In ICML, pp. 160–167, 2008.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
Ferrucci, David, Brown, Eric, Chu-Carroll, Jennifer, Fan, James, Gondek, David, Kalyanpur, Aditya A, Lally, Adam, Murdock, J William, Nyberg, Eric, Prager, John, et al. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010.
Ferrucci, David A. Introduction to ”this is watson”. IBM Journal of Research and Development, 56 (3):1, 2012.
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. Domain adaptation for large-scale sentiment classification: A deep learning approach. ICML, 2011.
Henriksson, Aron, Moen, Hans, Skeppstedt, Maria, Daudaravicius, Vidas, and Duneld, Martin. Syn- onym extraction and abbreviation expansion with ensembles of semantic spaces. J. Biomedical Semantics, 5:6, 2014.
Lebret, R. and Collobert, R. Word embeddings through Hellinger PCA. In EACL, 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Li, Zhen, Chang, Shiyu, Liang, Feng, Huang, Thomas S., Cao, Liangliang, and Smith, John R. Learning locally-adaptive decision functions for person verification. In CVPR, pp. 3610–3617, 2013.
Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word repre- sentations in vector space. In ICLR, 2013a.
Mikolov, Tomas, tau Yih, Wen, and Zweig, Geoffrey. Linguistic regularities in continuous space word representations. NAACL HLT, 2013b.
Miller, George A. Wordnet: A lexical database for english. Communications of the ACM, 38(11): 39–41, 1995.
Mnih, A. and Teh, Y. W. A fast and simple algorithm for training neural probabilistic language models. In ICML, 2012.
Norlindh, Fredrik. Extraction of synonyms and semantically related words from chat logs. 2012.
Rumelhart, David E, Hintont, Geoffrey E, and Williams, Ronald J. Learning representations by backpropagating errors. Nature, 323(6088):533–536, 1986.
Socher, Richard, Lin, Cliff C., Ng, Andrew Y., and Manning, Christopher D. Parsing natural scenes and natural language with recursive neural networks. ICML, 2011.
Turney, Peter D. and Pantel, Patrick. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141–188, 2010.
Wang, Tong and Hirst, Graeme. Extracting synonyms from dictionary definitions. In RANLP, pp. 471–477, 2009.