Keywords are an important means to capture most signifcant information in a text document. Tey are used in information retrieval, text summarization, social media streaming, etc. Twiter ofers an Application Programming Interface (API) for streaming of tweets. Te API supports many options to flter tweets based on several streaming endpoints, such as language, location, users, and keywords. Filtering by keywords is a very popular and efcient way to capture tweets based on target interest and domain. With the keyword flter, a tweet is captured if it includes any keyword from the keyword list. Many companies use the names of the company or products to monitor customer feedback, provided customer support, conduct focus groups, and engage in similar activities. However,
this approach of collecting feedback has limitations. Since the keywords are selected in advance and then kept static for an extended period of time, a useful tweet containing valuable user feedback but not including any of the selected keywords cannot be captured. For example, consider a bank using its name as the keyword to monitor online feedback about their new credit cards. If a Twiter user posts an image of a credit card as a tweet and comments, “I dislike the rewards of this credit card,” then this tweet would not be captured, even though it contains very useful information. Such cases frequently occur in replies to posted tweets or comments of retweets because the company or product names might already be mentioned in the original tweet and are not captured by the keyword. If the set of keywords is expanded by including fcredit card,f then too many tweets would be captured. Tere is even a bigger fundamental faw. In the near future, there might be a relevant event or activity coming up leading to a new topic that would not be captured by prior keywords. Tis calls for having a prediction model for the forthcoming topics and dynamically adjusting the keywords to anticipate emerging topics. Indeed, this research results from a collaboration with a large global company having challenges updating keywords on a regular basis.
As input the problem consists of a set of documents with a temporal dimension in a corpus and a set of keywords used to collect each document. Each document includes at least one keyword. Te documents collected in a time window form a set of topics. Te problem is to select a set of new keywords that are going to be used to collect documents in the immediate future. Te future topics which have to be predicted from the keywords implied by the yet-to-be-collected documents should not deviate too much from the recent topics, but, on the other hand, should capture possible emerging topics.
Word frequency is a straight forward approach to extract new keywords. TFIDF is an early word extraction algorithm comparing word frequency in a document and in the corpus to determine the importance of a word. Tere are extensions to this notion based on TFIDF that apply extra features, such as the length of the words, however, tweets are short (maximum 280 character posts) and thus it is hard to fnd meaningful words that frequently appear in certain documents but have low frequency in others. With TFIDF, it is also easy to pick background words that seem useful but are too broad to be keywords, such as “today,f “money,f etc. With such broad words as keywords, it is easy to flter short tweets that include these words but are of-topic. Tus we need a diferent algorithm and model to select meaningful keywords applicable to short text media like Twiter. In addition, TFIDF and its derived algorithms select keywords based on historical data, so the selected keywords can only represent past information. However, new keywords need to be used in the future instead of being based on the past. To capture useful information, we need to select keywords that should encapsulate upcoming topics and trends. Terefore, a generative model is necessary for our problem in addition to an inference part that is going to predict upcoming topics based on sets of keywords. Unfortunately, TFIDF does not provide such a generative model.
To select meaningful keywords that can be used to capture future relevant topics, we propose a generative model with a topic model as an information flter where priors depend on candidate sets of keywords. Tere are some keyword extraction models based on topics, such as Topical PageRank [15] and context-sensitive Topical PageRank [28]. Tese models calculate word ranking scores using the graph-based Topical PageRank and extract new keywords from the word ranking scores. Tey build word graphs and use topics to measure the importance of the words. In such models topics are fxed, and the keywords within the chosen topics are subsequently selected. Since the topics are fxed, these models do not meet our requirement of anticipating and predicting future active keywords together with resulting topics. In our proposed model, both topics and keywords have equal importance and are jointly modeled with the generative model. Te generative model captures past interactions between the keywords and topics and thus in inference given a set of keywords it predicts the resulting topics. A challenge in inference is to balance the fact that future topics should be similar to the current topics, yet at the same time they should capture emerging topics.
Diferent from Topical PageRank and context-sensitive Topical PageRank, which are limited to pre-selected and fxed topics, we model topics by a topic model (Latent Dirichlet Allocation (LDA), [3]). Te topics are modeled in time and depending on the underlying keywords, and thus can refect events and trends, such as a new product release. We assume that the set of all documents is grouped into subsets (e.g. all documents collected in a week) where each subset of documents is generated based on its own set of keywords. Te keywords pertaining to the diferent subsets can overlap. We model the interaction between keywords and topics through priors at the document-topic and topic-word levels. Te keyword selection process has its own simple generative process whose output forms the input to neural networks that in turn output the priors to LDA. Te model has three prior parameters , and
which are all trainable. In each updating period (e.g. a week) corresponding to a subset of documents and the underlying keywords, the selected keywords are frst sampled from prior parameter
. Te generated keywords work together with prior parameter
(weights in a neural network) to generate the document-topic distribution
through a neural network, and with the other prior parameter
(also weights in a neural network) through another neural network to generate the topic-word distribution
indicating the word distribution for each topic. Regarding word generation in document d, frst latent topic variable
is selected based on the generated set of keywords and next the keyword belonging to this document is sampled from the topic-word distribution
(which also depends on the generated keywords through a neural network). Te later sampling within the topic selected is based on a discrete probability conditioned on topic
and the topic-word distribution. For each remaining word in the document, latent topic
is sampled from the topic-word distribution
, and a word is chosen from discrete probability conditioned on topic
(this step follows standard LDA). In this way, every word and keyword for each document are sampled. Te process guarantees that each generated document has at least one word from the generated set of keywords.
Te model includes the standard LDA parameters which are augmented by the parameters pertaining to the generative process behind keyword generation and it also includes neural network parameters. Te parameters are trained by a combination of expectation maximization (EM), variational inference, and stochastic gradient optimization. Since the generative process behind keywords and a part of the incumbent LDA adjusted model are using discrete distributions, we also resort to the Gumbel sofmax trick in training. In inference, we need to decide which keywords to use for immediate future streaming based on the generative model.
Te proposed model is evaluated on three sets of tweets with the benchmark algorithm being a sophisticated algorithm based on NLP techniques (also designed in this work). In terms of accuracy appropriately defned the improvement is 74%, 61% over the benchmark algorithm if 2 or 3 keywords are recommended, respectively. If a random subset of keywords is selected among all candidates words, then the improvement is 3-fold.
In this paper, we propose a novel generative probabilistic model for keyword selection, Keyword Latent Dirichlet Allocation (KLDA). Te main contributions of our work as as follows:
(1) A topic based generative model is proposed for the future document keyword selection.
(2) Neural networks are introduced in producing priors. (3) An inference model that recommends keywords to use in the future trading of the facts that future topics should be similar to the current topics, yet they should include novelty in order not to miss emerging topics. In Section 2 we review the literature while in Section 3 the problem is formally stated. In Section 4, we describe the details of KLDA, including the generative process and inference. In Section 5 we introduce the datasets and discuss all of the experimental results.
2.1 Keyword Extraction
Keywords are an efcient and common way of fltering streaming textual data (in particular on Twiter). Several prior works deal with keyword extraction. Turney [23] proposes a supervised learning approach for this problem, classifying phrases as positive and negative in each document. Many later algorithms have been developed based on the Turney’s approach, with additional features and data sources. Hulth [9] introduces part-of-speech (POS) tags to represent the data, Yih et al. [26] analyze information from many resources, including meta-tags, URL, and query frequency, and Liu et al. [13] utilize a rich set of features such as lexical. However, deciding the importance of a word based on its frequency does not capture the resulting topics. In addition, supervised learning methods require accurate learning labels which are usually not available. Te goal is to periodically set new keywords for future upcoming documents, so it is hard to get labels for a massive amount of documents. Besides, we do not have a clear target for what the keywords should be. Terefore, unsupervised learning is necessary in our keyword selection problem.
Graph-based keyword ranking is a popular and efective strategy to extract keywords. PageRank [19] is proposed by Page and Brin to estimate the importance of Web pages. Relying on PageRank, Mihalcea and Tarau [18] propose a keyword extraction method based on their TextPage model. TextPage is widely used in later keyword extraction algorithms [14, 24]. However, these graph-based models do not consider the semantics of keywords. Tus, Liu et al. and Zhao et al. [15, 28] propose topic based PageRank focusing on keyword extraction in pre-selected and fxed topics. However, we aim to select keywords without knowing the topics a priori and thus we rely on the concept of predicting future topics.
2.2 Topic Models
Topic models discover semantic relationships in documents by maintaining latent topics of documents. Latent Semantic Indexing [20] is an early topic model that maps words and documents to a representation in a semantic space using Singular Value Decomposition assuming that the words with the same semantics have similar meanings. Hofmann [8] propose a generative model Probabilistic Latent Semantic Indexing that contains a statistical foundation to build document indexing and information retrieval. It obtains the probabilities of diferent topics by generating every word from a topic and thus each document may contain multiple topics. Based on this, Blei et al. [3] introduce LDA to overcome overfting of Latent Semantic Indexing models. A survey of topic models can be found in [4].
LDA is a generative model for detecting topics. It represents each topic with the word distribution and models every document as a mixture of multiple latent topics. Many models consider other factors, and extend LDA in diferent dimensions. Blei et al. [2] consider topic evolution in sequential documents and extends the topic model to the dynamic version, and Hofman et al. [7] extrapolate LDA from local analysis to online learning in streams. In image problems, Wang et al. [25] introduce image information into LDA priors to encode the spatial structure among visual cues, Li et al. [5] extend LDA to natural scene categories, and Barnard et al. [1] also apply LDA parameters on images for matching words and pictures. Lin et al. and Mass et al. [12, 16] generalize LDA in sentiment analyses, and Zhang et al., Steyvers et al., and Rosen-Zvi et al. [21, 22, 27] add user information into LDA priors. Similar to these works, we introduce keywords into the LDA model by means of a fexible prior. Te approaches are fundamentally diferent because of the dependence of topics on the underlying keywords.
We frst present notation and the seting. We assume that, from a fnite vocabulary set (where V represents the cardinality of the vocabolary), we are given a subset
called the keyword set, which is the focus of this work. We use Q to denote the cardinality of Q. A document d is a set of words
is the size of the set,
V for all
, and there exists at least one
such that
. Note that (1) we require each document to contain at least one keyword; (2) there can be more than one keyword in a document; (3) we adopt the bag-of-words assumption throughout, thus the exact order of the words in the document is of no importance. A corpus W is a sequence of documents
where we assume that the index refects the temporal dimension, i.e.,
is the oldest document and
most recent document. We denote by K the number of topics which is a fxed hyperparameter.
As input (training data) we are given W together with Q. Te main goal of keyword selection is to fnd a subset such that yet-to-be-collected documents
keywords
have similar topics as the topics of the most recent documents in the corpus. “Similarity” is defned later in Section 4; it needs to capture the desire to have identical topics yet to also include some emerging relevant topics.
As an example with tweets, let us assume that at the beginning of each week we specify a set of keywords. During a week we collect all tweets with respect to these keywords. Training data consists of all past tweets collected together with the keywords used. Te problem is to select a set of keywords for the next week tweets. In this example, W corresponds to all historical tweets and Q can be the set of all keywords used in prior weeks possibly augmented by additional potential keywords. Te keywords for the next week are denoted by . Te intent is that the topics of the tweets in the next week would be similar to the topics in most recent weeks but they should also capture some emerging relevant topics.
We approach this problem from the perspective of topic modeling. Each week of tweets yields a set of topics where the document-topic and topic-word probabilities depend on keywords used. We model this dependency through priors of the underlying Dirichlet distributions. Tis yields a relationship between a potential set of keywords and topics for the next week.
We frst introduce the generative model and then the solution methodology based on variational inference.
3.1 Generative Model
We propose the KLDA model (illustrated in Figure 1 lef) to explain the generative process of a document (labeled with subscript d and containing N words) with keywords, the steps of which can be summarized as follows.
Notation “Dir,”“Multin” represent the Dirichlet and multinomial distributions, respectively. Te trainable parameters are . In the exposition for simplicity we assume that N is fxed, but everything can be easily extended to the case when N is distributed based on the Poisson distribution.
Step 1 determines which of the words in the candidate keyword set Q are actually used as keywords for the current document. Binary represents the inclusion of
Figure 1: (Lef) Plate representation of KLDA. (Right) Graphical model Representation for the variational distribution approximating posterior of KLDA. Subscript d is used for variables that are realized at document level, while w for variables generated for each observed word in the corpus.
the corresponding keyword in d (1 if the keyword is in d and 0 otherwise). Tese keywords become part of the document (step 2). Note that 0 which means that each document must contain at least one keyword. We assume an energy-based distribution for keywords, where the unnormalized probability is ˜
with c being a hyperparameter. Te partition function
is a summation over all possible combinations of ˜
. As the result, the normalized probability is
. In the energy function ˜
encourages the model to generate keywords that are aligned with
, which represents the underlying probability of each keyword while term
penalizes z if too many keywords are included, which provides a regularization and prohibits the generation of a large number of keywords for a single document.
Next in step 3 we generate a probability vector of topics based on the Dirichlet distribution. Its prior
is a model (parameterized by
) that takes as input a binary vector of dimensionQ, and outputs a vector with Q positive numbers that can be used as a prior for a Dirichlet distribution. In our model the prior depends on the topics generated
In the next step 4 we generate a set of parameters that govern the topic-word distribution of the selected topic
. Tese parameters are specifed by function
which is a model (parameterized by
) that takes as input a binary vector of dimension Q and outputs a
matrix with each row corresponding to a distribution that is used as a topic-word matrix in the LDA model.
In step 5, is the number of keywords selected. Step 6 generates all of the remaining keywords in d. In step 6a a topic is then selected based on the multinomial distribution with probabilities
A word is fnally sampled in step 6b based on a probability depending on the generated topic and topic-word priors, and added to the document in the last step. Distribution
is a multinomial distribution conditioned on the one-hot vector z, meaning that for index k with
1, the distribution is a multinomial distribution parametrized by the
Te generative process implies the joint probability distribution
One peculiarity of this generative process is that on the surface the topic-word matrix seem to depend on document d. However each document generated on the same set of keywords is subject to the same topic-word matrix. Tis is consistent with our goal that a set of keywords implies a topic-word matrix. On the other hand, it is consistent with the standard LDA where there is a single set of keywords and thus just one such matrix not depending on documents.
We describe the specifc choices of and the procedures for the updates of their parameters in the next section.
3.2 Variational Inference with EM Algorithm
Due to the intractability of the proposed probabilistic model, we resort to variational inference combined with the EM algorithm for model training. Te variational distribution for latent variables in KLDA is shown in Figure 1 right and it reads
In the study we assume that is the Bernoulli distribution with success rate
which agrees with the assumption that multiple keywords can be presented in a document. Distribution
of a given document d is multinomial with parameters
. Finally,
is Dirichlet with prior parameter
. For topic weights
and topic labels
, we adopt the approach used in the standard LDA model (see [3]). In the variational distribution,
is sampled from a Dirichlet distribution parametrized by the
th row of
, a
matrix with positive entries. Latent variable
is sampled from a multinomial distribution parametrized by a probability vector
of length K.
Te introduction of keywords entails a new approach on the variational distribution that is amenable to both the categorical nature of and the framework of the EM algorithm. Here we use the Gumbel-sofmax trick [10] to generate keywords
in the variational distribution. Specifcally, the trainable parameter
is a
matrix, where each element
represents the probability that the
th candidate keyword is used as a keyword for document d. Te Gumbel-sofmax trick is used to approximate the categorical distribution of
in the following way where
is 0 if x < r and 1 otherwise.
For a fxed temperature parameter 0 and for any given document index
, we sample
form distribution on [0, 1], and we set
for j = 1, . . . ,Q,k = 1, . . . ,Q to represent a standard Gumbel sample. We then compute
and collect the vector . We then use
is applied coordinate-wise) to approximate
. Note that if
are binary, then
. Te Gumbel-sofmax trick provides a diferen- tiable, smooth, empirically efcient and stable gradient estimation approach for categorical distributions, which can recover the original discrete distribution when the temperature is annealed down to 0. For training of KLDA, we use a sequence of pre-determined, decreasing temperatures. In each iteration we frst sample
the Gumbel distribution described above, and then feed
into f and
for evaluating the loss function or computing a gradient during training.
With the aforementioned variational distribution, we apply the EM algorithm to maximize the Evidence Lower BOund (ELBO) (see [7] for a detailed description based on the original LDA model). In our context, the probability of the corpus W in KLDA isthe joint distribution of the corpus with all latent variables is denoted by
, and the probability of latent variables under the variational distributions is represented as
Ten we have
where denotes the expectation operator with respect to variational distribution q. Note that the introduction of the Gumbelsofmax trick afects the variational distribution and requires sampling from the Gumbel distribution at each iteration, and for largescale cases the training of the model needs to be executed in an online fashion, using only a batch of the entire data in each iteration. Terefore, the actual training is performed based on an estimator ˆL of
depending on the batch sampled in each iteration, and the sample
through the Gumbel-sofmax trick.
In the remainder, we abuse the notation of q to denote the revised variational distribution where sampling of is performed through the Gumbel trick. First, note that
where is the expectation with respect to the random vectors
sampled from the Gumbel distribution parametrized by
we consider each terms one by one, conditioned on
where for hyperparameter h set to 0.5 in experiments. Te indicator functions in (4) and (5) are capturing the fact that if a word has already been generated as a keyword, we do not need to generate this word again as a regular word; we use hyperparameter h for
to determine if a word has been generated as a keyword. Note that this is an approximation only because we are using the Gumbel trick and thus values of
not strictly 0 or 1.
A direct calculation gives an explicit form for each term (2)-(5). For term (2), we have
where the unnormalized probability ˜and partition function Z are defned in the previous section. For term (3), we have
where being the parameter for the variational distribution, and
and denoting the Digamma function. For term (4) given word with index w in document d we have
with coming from the variational distribution. For term (5) we derive
where is the parameter for z in the variational distribution, and
is the output of model
Also, for the variational distribution we have
Specifcally, in term (6), given the original Bernoulli distribution in Figure 1 right, we have
for vectors ; for term (7) we derive
and for term (8) we have
We next consider the case where we can only access a batch ˜of the entire corpus. Denote the number of documents in ˜W as
, and the total number of words of the documents in this batch as
. Ten we can use the following estimator with
adjusted weights for each term:
At an E-step, the optimal solution with respect to and
can be solved through the iterative updates for any
and k = 1
Te derivation can be found in [3]. For all the other parameters, the updates can be conducted through gradient descent with respect to the estimated log probability ˆL. Note that can also be updated in this fashion since
is essentially a function of
(and random uniform samples u).
3.3 Partition Function
Updating of requires special care with respect to log
in (2). Te exponential nature of this term also requires approximations in inference. We resort to sampling described next.
For estimating the gradient we use the following well-known fact (see [6] for details)
During training, the gradient estimator is obtained through the Markov Chain Monte Carlo algorithm. Specifcally, we employ Gibbs sampling, and maintain diferent independent chains whose state spaces are
, the possible range of
. Before training starts, we initialize all the chains by uniformly randomly sampling a possible
. At iteration t of gradient optimization in the M-step, we perform
steps of Gibbs sampling for each chain, and denote the current state for each chain by ˜
for each
is a preset sequence of integers. Te gradient estimator we use to perform stochastic gradient descent on
, conditioned on
from the Gumbel trick, is
During inference with respect to the logterm, we employ a simple importance sampling algorithm. Alternative options of MCMC or acceptance-rejection are not appropriate for the following reasons. Since the estimate ˆZ is used in the log function as an approximation to log(Z), even though
, it holds that
due to the Jensen’s inequality, and the gap is afected by the variance of ˆZ and thus we need a lot of samples (we use 100,000). For MCMC running that many chains till burn-in is too costly.
To be specifc, importance sampling is used to estimate the partition function Z, where we construct the following estimator
Values are i.i.d samples from distribution
constructed in the following way.
(1) Sample ˜N from Geometric distribution with mean 1/0.25, and set to correspond to the number of keywords;
(2) Randomly sample a subset ˜(the sampling is without replacement);
(3) Output a vector 0 otherwise.
If the distribution of h is relatively close to the distribution of Z, the variance of the estimator should be low. Te designedh is trying to match the generative process. Even though we are ignoring the actual dependency on for each candidate keyword and use uniform distribution instead, the geometric distribution is capturing the fact that, due to the penalty term c > 0 in the defnition of ˜
it is approximately geometrically or exponentially unlikely to have many keywords.
Note that h(X) is simple to evaluate since it only relies on the number of positive elements in X, and only requires the summation of a geometric sequence. In our experiments, we set so that the variance of log
is small enough when plugging in the estimator ˆZ, which further ensures that the bias to log
is small enough.
3.4 Overall Training Algorithm
Te overall algorithm is presented in Algorithm 1. Te main loop in step 2 iterates over all EM iterations. In the inner loop 3 we frst create independent samples in support of keyword candidates based on Gibbs Sampling described in Section 3.3. Te actual samples are derived in step 6 for the select batch samples of documents. Te forward step is executed in step 7. Possible several E steps are done in step 8 while a single M-step gradient adjustment is performed in steps 11-15.
Afer training the KLDA model, we obtain model that can output the topic-word matrices given keyword set z, and vector
be used to evaluate the possibility that keyword set z is actually generated by the KLDA procedure. Te next goal is to decide the best possible keyword sets for further collection of documents. To this end, we propose two inference metrics that aim to capture two antagonistic purposes of deriving a keyword set. Let us recall that the documents are assumed to be collected in time (which is actually disregarded in training). Let
be the most recent set of keywords used to collect documents
for a hyperparameter s. In practice s should refect the frequency of keyword updates, e.g., if we plan to update the keywords once every week, then these documents should correspond to the documents in the last week and thus
are the keywords used in the previous week.
Given a subset of keywords, then
is a topic-word matrix where 1
is the binary representation of U . Let
/ K is a distribution over vocabulary (with
denoting the k-th row of matrix
. Let also freq(S) be the ratio of the number of documents in
containing every word in S and s (the number of all documents in
To be more specifc about the goal, assuming that a set is provided, the goal is to identify another subset
that, in terms of the possible documents generated by keywords contained in
, it maintains information in documents
corresponding to
but also captures new information and provides a natural extension of previous topics.
We construct by considering each word
and try to “extend” it with a diferent word
. Te set of all derived
form our fnal set
of keywords. Let
be fxed in the subsequent discussion. We consider the strength of
by using the following two metrics.
Te frst metric aims to capture that can be seen as a good extension of keyword
, since either of them generates documents that cover similar topics. We need to capture that as long as one of the keywords appears in the document it would be meaningful to collect it. Choices for such extensions
should satisfy the following criteria: (1)
should be relatively frequent in
(we set the threshold as
erwise the mixing of
would be extremely similar to using
only, thus rendering the next criterion meaningless; (2) the KL divergence between the word distributions induced by
and
should be small (
should be small where KL denotes KL divergence). In our experiments we found that, instead of using KL divergence, most other
divergence measures can be used and the selected keywords remain largely similar, so other divergence metrics can be used if need be. As long as the frst threshold is satisfed, we prefer the keywords that induce as small KL divergence as possible.
Besides this metric, we propose the second metric that is used as a reference and helps fltering out some obviously bad choices before applying the previous metric. We use the term high-frequency distance for the quantity defned as follows. Let us assume we have hyperparameters (high-frequency threshold),
(retention rate), and ˆc > 0 (penalty weight). We defne the following sets
and set the high-frequency distance as Intuitively, this quantity translates the idea of “maintaining old information but also extending to new information” into a value that is based on high-frequency words in both cases. In our experiments we set
to be a value that ensures that the high-frequency sets
are of size around 100 (the value is set as 1/1, 000), and we set ˆc = 1,r = 1/2. Note that if
, then the two sets are empty and thus
Te actual inference procedure is presented in Algorithm 2. Note that . It can conceptually happen that
is returned by the algorithm. In such a case which was never observed in the experimental study, we can select
5.1 Training Details
Related to (1), function s is not smooth and it requires samples. In order to circumvent this in the computational experiments we instead sample Q binary Gumble approximations as follows. For any given document index
, and keyword index
, we sample two independent copies
from the uniform distribution on [0, 1], and we set
1. We then compute
and defne the vector to approximate
this is smooth it has the drawback that it allows
0, i.e., no keyword is in document d. Tus this necessitates allowing state 0 in MCMC (see Section 3.3).
For our experiments, we set K = 5 topics (by fnding a good number of topics using plain LDA), and the batch size B = 64. For the distribution of , we set the penalty c = 2. For model
word matrices) we use a 2-layer neural network with 32 neurons inn each layer, using LeakyRelu activation with weight 0.01 for the negative quadrant (see [17] for details on LeakyRelu). For model f , we fx it and always use a uniform 1 vector as the output. In other words, in our experiments we use an uninformative, uniform prior for topic weights. We tried to use a parametric f but we observed that the trained model tends to use only one of the K topics and the others have an extremely small probability to be generated. We have tried to add a Lagrangian penalty term to prevent the topic weights from being too extreme but at no avail. For training of model
and gradient ascent on most parameters, we set the learning rates as 0.001 and use the Adam optimizer [11]. For updates of
, however, we use RMSprop (see [29]) with learning rate 0.005 to cope with the fact that the gradient on
is extremely sparse as only the
corresponding to a mini batch of documents is updated at each iteration. We also add an L2 regularization term to loss function (9) with the regularization penalty of 0.1.
For the MCMC steps, we set1 otherwise. In other words, periodically we perform more Gibbs sampling steps to ensure the quality of the samples afer updating
. For the Gumbel trick, we anneal down the temperature based on the following scheme
which is partially following the choices in [10] but we set the threshold as 0.25 as we try to further anneal down the temperature and beter recover the original categorical distribution.
Since the corpus is given, during training of KLDA, we cannot allow the model to generate a keyword that is not observed in the document. To this end, at each iteration afer we obtain through the Gumbel trick, we apply a binary mask to all the elements in
, which multiplies 0 to slots of candidates keywords that do not appear in the document. Terefore, the masked candidate keywords do not afect the evaluation of ˆL, and the corresponding
need to be updated.
Training of is critical to our inference, since it determines the process that generates keywords. In our experiment we apply several tricks to improve the quality of
. First, we change the weights of the negative MCMC samples for estimating the gradient of the partition function. In (10) we apply a weight
and use the revised form
In our experiment we set 1. Second, we add an
ization term in ˆL as the squared error between normalized exp
and the empirical probability of each candidate keyword in the corpus. We set the coefcient of this
regularization term as 1.
Lastly, to improve the quality of model , we adopt a pre-training scheme before training the KLDA model. Te only diference between pre-training and complete KLDA model training is in the following. In pre-training, there is no generation of keywords
instead we use the real observation as
, meaning that each element in
is 1 if the corresponding keyword appears in the document, and 0 otherwise (namely, we skip the stochastic generation of keywords, and deterministically generate a candidate keyword as a keyword if it is present in the document). We use the same hyperparameters and model structures as described above. Note that this pre-training step does not involve updates on
in fact we skip the keyword generation step), and only serves to improve the quality of topic-word matrices output by
. We run pretrain for 2, 500 iterations, and when we start training the full KLDA model, we load the parameters of model
from pretrain to initialize the new
5.2 Data
We next give a briefng of the data used in the experiments. In order to test the model’s capability under diferent setings, we use collections of Twiter posts that concern various topics and evolve over time. Specifcally, data collection is done through the Twiter streaming API; the entire collection lasted for 11 weeks in year 2019 for each dataset with keywords adapted at the beginning of each week. During this period we maintain 3 sets of tweets revolving around 3 diferent topics: horse-racing, streaming, and Korean pop music (the frst dataset was intended to be a “computer virus” related dataset, but evolved into a horse-racing dataset through the keyword selection benchmark procedure described next). We selected these topics to have variety.
Te benchmark algorithm relies on a prediction model that classifes if a tweet is likely to become viral or not. Te features of a tweet are the user identifcation, text length, number of URLs, number of mentions, number of hashtags and emojis, the number of users the user is following, the number of followers of the user, the number of favorites of the user, and the total number of posts of the user. Te label is defned as viral for each tweet that has been retweeted at least ten times in a month and the model is trained based on 3 months of tweets.
In terms of the data collection procedure for any dataset, in the frst week we manually pick a set of keywords that are related to these topics and, by querying the API, collect tweets posted in the following 7-day period that contain any of the given keywords. For all subsequent weeks we apply the following algorithm called viral to adjust the set of keywords. Te algorithm also serves as the main benchmark algorithm.
• We frst calibrate an LDA model on all tweets from the last week.
• We next perform inference of the tweets collected in the last week by using the viral prediction model.
• From all tweets predicted to become viral, we select 15 highest count words (excluding words that are not nouns or proper nouns which implies that we frst perform part-of-speech tagging) as keyword candidates.
• Let tweets(k) be the set of all tweets in the previous week containing keyword k where k is one of the candidate keywords based on the viral tweets.
• For each keyword k of the 15 keywords and for each tweet in tweets(k) we check if the tweet belongs to one of the topics. Tis is done by using the LDA probabilities. If over 60% of the tweets in tweets(k) belong to a topic, keyword k is selected and otherwise it is discarded.
Te ‘viral’ component produces candidate keywords present in tweets that are likely to become important. Te LDA part is necessary to assure that the selected keywords are not deviating too much from the desired topics. For example, in the case with the industry partner, the company had a promotion related to the FIFA World Cup. If the LDA part is removed, then the viral tweets capture everything related to the World Cup which is too broad and most of the material of no interest to the company.
We execute the algorithm for each of the remaining 9 weeks for each one of the 3 datasets. Te keywords produced and thus used in the collection process are exhibited in Table 1. We remark that based on this algorithm the initial topic of “computer virus” evolved to horse-racing in the frst dataset.
Next, based on the tweets collected in this 10-week span, we generate a set of words that are the union of the following: (1) recommendation from viral using the tweets in week 10; (2) high-frequency nouns in tweets not recommended by viral; (3) any candidate keyword that has been used in the 10-week data collection period. Tis candidate keyword set is used to collect tweets for one more week, labeled as week 11, and is utilized as the test set in our experiment. Table 2 lists the keywords for week 11 that are added to the keywords in Table 1 and Table 3 provides basic statistics of the datasets. KLDA for week 11 only recommends a subset of the keywords shown in Tables 1 and 2. At the end of week 10, we assemble the keywords for week 11 and then collect all tweets during week 11 based on all these keywords.
Table 1: Evolution of Keywords, Weeks 1-10
Table 2: Additional Keywords for Week 11
Table 3: Information of Datasets Used in Experiments
We add a few remarks regarding collection and preprocessing of the datasets. First, we apply stemming and removing of the stopwords. Besides, we reduce the vocabulary size by replacing all the low-frequency tokens (appearing less than 10 times in all tweets collected during the frst 10 week) by a placeholder token. Te purpose is to facilitate model training by preventing rare typos and acronyms, since tweets are more prone to these issues. As we observe in the collected datasets, less than 10% of tokens (words) have appeared 10 or more times in the entire dataset. By substituting the low-frequency tokens with the placeholder token, we manage to reduce the vocabulary to a reasonable size (see Table 3 for details). Lastly, if a token outside of the vocabulary is observed in the tweets collected in week 11, it is also replaced by the placeholder token, thus maintaining a consistent vocabulary between weeks 1-10 and week 11.
5.3 Computational Results
We outline the objective of the computational experiments as follows: for each dataset, diferent methods yield diferent sets of recommendations of keywords for week 11; by collecting tweets using candidate keywords recommended by diferent methods at week 11, we are able to evaluate the performance of diferent methods based on tweets in week 11 (given a set of keywords for week 11, we simply select the tweets containing at least one keyword from the set; since such keywords are a subset of all keywords used in the collection process for week 11, no tweet is lef out); the comparison of diferent methods under various topics also demonstrate the versatility and stability of diferent approaches.
More specifcally, we compare KLDA against viral and a randomly chosen keyword set. Recall that the inference algorithm based on KLDA generates recommendations for a given keyword by evaluating all the remaining candidate keywords under the aforementioned metric. Based on the score of the metric, we need to specify a threshold: candidate keywords with metric scores meeting the threshold shall be accepted into the recommendation list. In this experiment we compare the results from two diferent thresholds: we either accept the top 2 words or the top 3 words. Troughout, we use top 2 or top 3 to refer to the two diferent set of criteria, respectively. As for viral, recall that the recommendation is generated for the entire set used in week 10 in one shot, instead of extending each keyword in the set as is the case in Algorithm 2. Lastly, to generate a random keyword set as another benchmark,
Table 4: Perplexity of Trained Models
Table 5: Recommendation of Keywords from KLDA Model: Individual Keyword Level
we randomly choose 3 keywords used in the week 11 candidate set, with the restriction that only those that are not recommended by Algorithm 2 or viral are taken into consideration. Te random set generated for the experiment is: Horse-racing: town, trojan, know; Streaming: workshop, everyone, hero; KPOP: know, everyone, luv.
Table 6: Performance of Diferent Recommendations
We frst demonstrate the predictability of the trained model in terms of topics of future tweets, thus showing KLDA’s capability for this modelling problem. Specifcally, we frst train a model using tweets collected from weeks 1 to 9; for tweets in week 10 we train a separate LDA model; we then compare the perplexity of the two models when fting the tweets in week 10. As shown in Table 4, we see a consistent patern across all the three datasets: by training a model using tweets from week 1-9, we are able to ft tweets in the next week with very good perplexity, which is both a signifcant improvement compared to “no training at all” (a model that is randomly initialized and not trained at all), and close to the performance of the LDA model trained on week 10 directly.
As mentioned before, the KLDA inference Algorithm 2 generates recommendations by inspecting each keyword used in week 10. We next discuss the performance of KLDA at the individual keyword level, the result of which is illustrated in Table 5. Te metrics accuracy and coverage are interpreted as follows. Calling the top 2 (or 3) recommendations for week 11 based on the weeks 1-10 model as predictions, and the top 2 (or 3) choices in week 11 based on the two criteria discussed in Section 4 with respect to LDA based on tweets in week 11 as truth, we claim that a prediction generated by KLDA is accurate if it is in truth, and we say one truth is covered by KLDA if it is in the prediction set. We use “none” to indicate that due to bad performance of the high-frequency distance defned in Section 4, all candidate keywords are excluded and there is no good recommendation based on the tweets in week 11. We observe that KLDA demonstrates great accuracy when accepting top 2 choices, and the result further improves when we are allowed to accept top 3 choices.
Te comparison of the recommendations for week 11 under the diferent methods is summarized in Table 6. We observe that under both top 2 and top 3, KLDA yields the highest average accuracy; besides, KLDA also performs the best for 2 of the 3 topics. In the KPOP dataset, which is the only case KLDA is outperformed by viral, note that it is expected that viral outperforms since it generates a recommendation set of a much larger size, and even so it produces only one accurate prediction out of 11. In general, it is expected viral to have higher coverage since it selects many more keywords.
[1] Kobus Barnard, Pinar Duygulu, David Forsyth, Nando de Freitas, David M Blei, and Michael I Jordan. Matching words and pictures. Journal of machine learning research, 3(Feb):1107–1135, 2003.
[2] David M Blei and John D Laferty. Dynamic topic models. In Proceedings of the 23rd international conference on machine learning, pages 113–120. ACM, 2006.
[3] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of machine learning research, 3(Jan):993–1022, 2003.
[4] Ali Daud, Juanzi Li, Lizhu Zhou, and Faqir Muhammad. Knowledge discovery through directed probabilistic topic models: a survey. Frontiers of computer science in China, 4(2):280–301, 2010.
[5] Li Fei-Fei and Pietro Perona. A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the computer vision and patern recognition conference, volume 2, pages 524–531. IEEE, 2005.
[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
[7] Mathew Hofman, Francis R Bach, and David M Blei. Online learning for latent Dirichlet allocation. In Advances in neural information processing systems, pages 856–864, 2010.
[8] Tomas Hofmann. Probabilistic latent semantic indexing. In ACM special interest group on information retrieval forum, volume 51, pages 211–218. ACM, 2017.
[9] Anete Hulth. Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on empirical methods in natural language processing, pages 216–223. ACL, 2003.
[10] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-sofmax. arXiv preprint arXiv:1611.01144, 2016.
[11] Diederik P Kingma and Jimmy Ba. ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[12] Chenghua Lin and Yulan He. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on information and knowledge management, pages 375–384. ACM, 2009.
[13] Fei Liu, Feifan Liu, and Yang Liu. Automatic keyword extraction for the meeting corpus using supervised approach and bigram expansion. In Spoken language technology workshop, pages 181–184. IEEE, 2008.
[14] Feifan Liu, Deana Pennell, Fei Liu, and Yang Liu. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proceedings of human language technologies: Te 2009 annual conference of the North American chapter of the ACL, pages 620–628. ACL, 2009.
[15] Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and Maosong Sun. Automatic keyphrase extraction via topic decomposition. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 366–376. ACL, 2010.
[16] Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and Christopher Pots. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the ACL: Human language technologies-volume 1, pages 142–150. ACL, 2011.
[17] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifer nonlinearities improve neural network acoustic models. In Proceedings of the international conference on machine learning, volume 30, page 3, 2013.
[18] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, 2004.
[19] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. Te pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
[20] Christos H Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, and Santosh Vempala. Latent semantic indexing: A probabilistic analysis. Journal of computer and system sciences, 61(2):217–235, 2000.
[21] Michal Rosen-Zvi, Tomas Grifths, Mark Steyvers, and Padhraic Smyth. Te author-topic model for authors and documents. In Proceedings of the 20th conference on uncertainty in artifcial intelligence, pages 487–494. AUAI Press, 2004.
[22] Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Tomas Grifths. Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, pages 306–315. ACM, 2004.
[23] Peter D Turney. Learning algorithms for keyphrase extraction. Information retrieval, 2(4):303–336, 2000.
[24] Xiaojun Wan, Jianwu Yang, and Jianguo Xiao. Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction. In Proceedings of the 45th annual meeting of the ACL, pages 552–559, 2007.
[25] Xiaogang Wang and Eric Grimson. Spatial latent Dirichlet allocation. In Advances in neural information processing systems, pages 1577–1584, 2008.
[26] Wen-tau Yih, Joshua Goodman, and Vitor R Carvalho. Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, pages 213–222. ACM, 2006.
[27] Wei Zhang and Jianyong Wang. Prior-based dual additive latent Dirichlet allocation for user-item connected documents. In Proceedings of the international joint conference on artifcial intelligence, volume 15, pages 1405–1411, 2015.
[28] Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song, Palakorn Achananuparp, EePeng Lim, and Xiaoming Li. Topical keyphrase extraction from Twiter. In Proceedings of the 49th annual meeting of the ACL: Human language technologies-volume 1, pages 379–388. ACL, 2011.
[29] Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufcient condition for convergences of ADAM and RMSprop. In Proceedings of the computer vision and patern recognition conference, pages 11127–11135, 2019.