Diversified Hidden Markov Models for Sequential Labeling

2019·arXiv

Abstract

1 INTRODUCTION

SEQUENTIAL labeling is an important meta-problem inmany real world applications, including natural language processing (NLP) tasks [38] [37], video analysis [40] [33], protein secondary structure [27] [7]. It has received considerable attentions in the past years. One of the fundamental models for sequential labeling is HMM, which assumes a chain of discrete-valued latent states and each of them depends only on the immediate neighboring states. Conditioning on this latent chain, the observations are probabilistically independent. Take PoS tagging task from NLP as an example, speech tags (NNS-Noun, plural; MD-Modal; etc) are discrete-valued latent state, while words (directors; are; etc) are observations.

However, the parameter-learning task in the classical HMM implemented with expectation-maximization (EM) algorithm performs unsatisfactorily in unsupervised setting for sequential labeling [23]. A key reason for this drawback is the well-known fact that maximum likelihood estimation (MLE) with mixture parameters has the tendency to converge towards a singular estimate at the boundary of the parameter space [13] [9], no matter how the observations are actually distributed (e.g. normally distributed or askew distributed). With improperly estimated parameters, the performance of inference of the latent states can be severely unsatisfied. Besides, the identifiability of parameters is another issue for HMM parameter learning with MLE implementation.

Therefore, a penalized MLE with properly chosen prior distribution over parameters is essential for the practical applications of HMM. For example, smoothing penalty [50] and sparse penalty [47] [32] [52] [34] are two popular priors over either transition distribution parameters or emission distribution parameters for sequential labeling.

Different from the early works, we have explored the usage of diversity prior over a joint distribution of rows of HMMs transition matrix, in order to make these transition distributions more distinct when decoding the sequential latent states. In the cases, where the transition probabilities become similar, an HMM model approaches to a static mixture model. We can We can understand this intuition by considering an extreme case where each rows of a transition probability matrix are identical. This leads to the same state transitions regardless of the state we are currently in. Suppose we have a k-state HMM, parameterised by , i.e., the initial probability , transition probability matrix A and emission probability B, if the rows of A are identical and given by vector a, the joint probability of the hidden states and observations over a sequence of length T can be calculated as:

P(X, Y |π, A, B

It can be seen that the joint probability becomes an independent product of variables at individual time point, and thus the HMM model becomes a static mixture model, i.e., the data become exchangeable. In contrast, a prior that encouraging diversity is able to reserve the dynamics property of HMMs. To the best of our knowledge, this is the first paper to apply diversity prior over HMM parameters. We do so by incorporating a recently introduced Determinantal Point Processes (DPP) [30] methodology, which essentially defines a Probability Mass Function (PMF) that assigns higher probability to a diverse subset of data. Inspired by the work of [53], we propose a diversified HMM (dHMM) by extending the basic HMM with determinantal-driven diversity.

Specifically, our contributes can be summarized in the following:

• We extend the HMM to dHMM by incorporating a diversity-encouraging prior over transition distributions, with which we intend to mitigate the problem of singular estimate in HMM.

• The use of a prior does not change the E-step of the EM algorithm in a fundamental way. However, for the M-step of the estimation of dHMM parameters, we derive a new formulation to incorporate the continuous DPP prior over the transition probabilities.

• We demonstrate the effectiveness of dHMM under both the unsupervised and supervised settings by applying dHMM to benchmark sequential labeling problems, including: part-of-speech tagging (PoS tagging) and optical character recognition (OCR).

The rest of this paper is organized as follows. Section 2 reviews related literatures on both progresses of statistical HMM model for sequential labeling and recent development on DPP. Section 3 introduces our proposed model in detail. We briefly review both continuous Determinantal Point Processes (DPP) based on probability kernel and basic Hidden Markov Models (HMM). How diversity-encouraging prior is encoded into HMM and how the induced Maximum A Posterior (MAP) objective problem is solved are also presented in details. Both simulated and real-world experiments are conducted in section 4. Finally, conclusion is discussed in Section 5.

2 RELATED WORK

Our work utilized two fundamental machine learning frameworks: HMM and DPP. On one hand, HMM is a fundamental statistic model for modeling sequential dataset and of significant importance in many other fields, from speech recognition [42], handwriting recognition [21], video analysis [33], gesture recognition [47], gene sequence prediction [27], to optical character recognition (OCR) [12] and part-of-speech tagging (PoS tagging)) [14]. On the other hand, DPP is a probabilistic model of repulsion in quantum physics area and recently introduced in machine learning area. Since it has closed-form solution for inference, recently, lots of its extensions are developed. This section presents a review of previous work in both DPP and HMM.

2.1 HMM extensions with priors

We briefly review the development of HMM for sequential labeling. It is well known that the HMM parameters contain three parts: (1) initial state distribution, (2) one transition distributions and (3) parameters associated with emission distributions, discrete or continuous. Various extensions of HMM have been proposed by incorporating proper prior information onto either one part or all three parts of these parameters.

Supervised sparse HMM [47] was proposed to improve the expressive power for sequential surgical gesture classi-fication and skill evaluation. It assumes that the emission distributions sparsely and linearly constitute elements from dictionary of basic surgical motions, no matter the observations are discrete, Gaussian or factor analyzed. Training dataset is needed for dictionary learning for each gesture together with an HMM grammar describing the transitions among different gestures. With learned dictionaries and grammar, the testing motion data is represented and clas-sified.

Supervised large margin continuous density HMM (CDHMM) for automatic speech recognition was proposed by Sha and Saul in [43]. The real-valued observations (such as acoustic feature vectors) are modeled through Gaussian mixture models. Inspired by support vector machines, margin maximization is applied as training objective function which is defined over a parameter space of positive semidefinite matrices. This optimization problem can be solved efficiently with simple gradient-based methods.

Unsupervised learning is a more difficult but important problem, as it eliminates the need for expensive manual annotation. It was demonstrated from the work of [50] that, smoothing HMM parameters can achieve significant improvements for PoS tagging. Two strategies have been applied: the first one is to smooth the emission distributions by computing observed word similarities. The second one is to specify a stationary distribution for hidden states to constrain the transition distributions.

Unsupervised sparse HMM based on Bayesian framework also has been explored in several literatures. Unlike supervised sparse HMM [47], the emission distributions are learned from sparse representation, unsupervised sparse models add priors on transition distributions. [8] introduces a negative Dirichlet prior on the transition distribution, which strongly encourages sparseness of the model. Then, maximum a posteriori (MAP) probability estimation of HMM parameters is devised under a modified Expectation Maximization algorithm. Manuele et al. evaluate the proposed technique on a 2D shape classification task. In [19], for PoS tagging, rather than performing MAP parameters estimation first then followed by inferring hidden states, Goldwater and Griffiths directly identify a distribution over latent variables, without ever fixing particular values for the model parameters. This is achieved by integrating over all possible values of parameters under a Bayesian approach. The integrating over parameters space permits the usage of linguistic appropriate priors. For example, the symmetric Dirichlet prior may prefer equally, uniform or sparse multinomial distributions according to different settings of its hyper-parameters.

There exist other important extensions of HMM for determining the number of hidden states which is a key parameter in all clustering tasks. Non-parametric bayesian method is one popular solution for this problem. For instance, Guojun Qi et al. [41] applied hierarchical dirichlet processes prior over transition matrix to model the number evolution of the dynamic community structures. Here we fix this parameter from experience and we refer to [14] [49] for interested readers.

From these previous works, proper prior information [11] encoded into HMM leads to visible performance increment. Either smooth prior or sparse prior is somewhat reasonable from technical view. From intuitive view, different states would have different transition distributions, otherwise, HMM will finally fall to a ’static’ mixture model. Diversity-encouraging prior is reasonable and also natural to many real-world sequential applications. In next subsection, we review DPP, which is an elegant statistical models of diversity, and its recent developments.

2.2 DPP models of diversity

Determinantal Point Processes (DPP) provide a probability measure over every configuration of subsets on data points. Using data’s similarity matrix and a determinant operator, DPP assigns higher probabilities to those subsets with dissimilar items [31]. It coincides with the phenomenon which naturally arises in physics (fermions, eigenvalues of random matrices) and in combinatorics (non-intersecting paths, random spanning trees) [20] and is used to capture the repulsion [6] among particles. One advantage of DPP is that the inference of DPP can be solved in polynomial time, which is required by many real-time large-scale applications. We focus on the prior modeling power of DPP and here only name a few recent developments of DPP for interested readers. For DPP (both discrete and continuous models [4]), basic inference algorithms, e.g. marginal inference, conditional inference and sampling [31], Maximize A Posterior (MAP, [17]), are well-developed. Parameter learning of DPP kernels have been recently addressed by [1] and [15]. The technique on measuring the repulsion of DPP was surfaced recently [39].

James Y. Zou [53] is the first literature that introduced the DPP prior into generative latent Dirichlet allocation (LDA) model. The intuition is that co-occurring words only appear in a small number of topics. Using a positive definite kernel function, it specifies a preference for diversity over the joint word-topic distributions.

In addition to the DPP prior for LDA [53], three more DPP-prior based Bayesian models have been developed. [44] developed a Determinantal Clustering Process (DCP) by using the DPP to define probability measure over point set. Since the diversity prior is placed over all possible partitions of the data, DCP is a nonparametric Bayesian approach, which is useful for semi-supervised clustering task.

Applying DPP as spike-and-slab prior is useful in the context of variable selection under the Bayesian framework. By making use of the repulsion property of DPP, [25] improved the prediction accuracy of linear regression through the collinear predictors to be less likely selected simultaneously.

J. Snoek and R. P. Adams [45] tried to model temporal sequences of spikes to reveal the complexities underlying a series of recorded action potentials. In this paper, DPP models the hidden sequential neurons to capture and visualize the complex inhibitory and competitive relationships.

3 DIVERSIFIED HIDDEN MARKOV MODELS

In this section, our proposed dHMM and its MAP solution are presented in details. The graphical model of the proposed dHMM is illustrated in Fig. (1). In order to make this paper self-contained, we illustrate all of its steps as well as the background knowledge leading up to our new work: (1) We first briefly review the basic models: DPP and HMM. (2) Then, the probability product kernel is introduced as a basic building block for DPP. (3) Our dHMM is subsequently represented. (4) Finally, we detail the inference steps to solve the proposed dHMM.

3.1 Continuous determinantal point processes

Determinantal Point Process (DPP) is one popular approach to assign Probability Mass Function (PMF) to each subset of an arbitrary dataset, in either discrete sets space [30] or continuous sets space [3] [2]. By defining a pairwise similarity between the data elements, usually in terms of a Kernel function, DPP assigns higher probability to dissimilar subsets in terms of the determinant of the kernel matrix restricted to the selected subsets. Equally, DPP prefers diverse subsets.

Formally, given a base set in a continuous space, and a positive semidefinite kernel function ,

where is the sub-matrix of K with the restriction to the entries indexed by the elements in are eigenvalues of the kernel K. Let be the mapping function: , the kernel function can be explicitly expressed as:

where is the inner product in Hilbert space. The DPP is geometrically denoted as:

It can be seen that the probability defined by DPP relates to the squared |Y |-dimensional volume of the parallelepiped spanned by the selected items in the associated Hilbert space of K. It prefers diverse subsets, because their feature vectors in the Hilbert space are more orthogonal, and hence span larger volumes.

When the cardinality of a diverse subset is fixed to k, which is required by many applications, it is referred as k-DPP [29]. The corresponding probability density is given by:

where and is the kth elementary symmetric polynomial [29].

DPP has closed-form solutions for normalization and marginalization, which makes inference efficient. Dual representation and other efficient approximation methods for large-scale problems have been also developed. Interested readers should refer to [28] [5] [24] [46] [16] [18] for more details.

3.2 Kernels for probability diversity

We prepare a probability kernel, which allows us to apply DPP on HMM’s transition probabilities. In this work, the Probability Product Kernel, which is proposed in [22], de-fines the kernel function between distributions and of discrete variables, which are parameterized by respectively:

for and x runs through all possible values of discrete variable X. The kernel is computed by summing up the products between the two distributions in terms of x.

For distributions and , the less ‘correlation’ between them, the more ‘diversity’ we can gain through the determinant of the kernels. To remove the scale effects of different probabilistic measurements, the normalized correlation kernel function is applied:

The final continuous DPP kernel as a building block in our proposed model is: , where is matrix, and with for probability measure.

3.3 Log-likelihood function of Hidden Markov Models

Hidden Markov models assume what are being observed are generated by a Markov process with unobserved hidden states. It is especially known for their applications in temporal sequential pattern recognition.

In Fig. (1), a hidden Markov model is a k-state Markov chain observed at discrete time points t = 1, 2, ..., T. Let be the finite state space. One state can be transfered to all other states with probability distributions parameterized by transition matrix . We use as state variables, and means HMM is staying on state at time step denotes the transition probability from to , which is equal to . In unsupervised setting, hidden variables cannot be observed directly, which are represented by hollow circles in Fig. (1). In contrast, filled circles denote observations . Each chain observation is parameterized by i.i.d emission distribution B given hidden states. For each hidden state, its probability distribution is only dependent on its former state in the first-order HMM. The joint probability distribution over hidden variables and observations is parameterized by [42]. The likelihood is as follows.

The linear constraints are required by discrete probability measure. The last statement above means that parameters B of emission distributions should also satisfy the requirement of probability measure in either continuous or discrete space, decided by various applications. Since supervised HMM is a special case of unsupervised HMM with known hidden state for training period and known parameters for test period, here we just demonstrate the solutions for unsupervised case. Three basic problems of HMM are identified in [42], namely, adjusting parameters given observations for unsupervised setting (or for supervised setting), computing log likelihood for unsupervised setting and inferring hidden states given both parameters and observations for both unsupervised setting and supervised setting during test period. For unsupervised learning, these three problems are closely associated with the likelihood, which is computed by marginalizing out the hidden variables from the joint distribution. The log likelihood for one sequential observation is:

With Markov assumption of HMM and Jensen’s inequality, the lower bound of the intractable formula in Eq.(3) is:

where and are marginally unary and pairwise distributions of hidden variables.

Traditional HMM is solved under EM framework. As noted, it usually produces flatten emission distributions and meaningless transition matrix. In the next subsection, we detail how to encode the diversity-encouraging prior into the transition distributions and how to solve the three basic problems identified by the traditional HMM.

3.4 Proposed Diversified HMM

With all the previous concepts in hand, we can now proceed with our proposed diversified HMM (dHMM).

The HMM’s transition distributions obey multinomial distribution parameterized by , where t is the time index and k is the number of hidden states. The corresponding normalized correlation kernel function for rows of A based on Eq.(2) is as:

And the corresponding diversity prior of transition parameter matrix of HMM modeled by Determinantal Point Processes (DPP) based on Eq.(1) is:

where is kernel matrix, with 1. A is a k-size subset from the simplex. symbols kDPP. For all experiments based on our proposed dHMM, we set .

The graphical model of our proposed model is illustrated in Fig. (1). The bottom chain structure is a standard first-order HMM. Applying conventional symbols of graphical models, hollow circles indicate hidden states, while filled circles are symbols of observations. Similar to [53], we draw a double-struck plate to denote the DPP prior placed on the state transition matrix A. Higher the probability of DPP is, the more diverse of the rows of an HMM’s transition matrix is.

3.4.1 Unsupervised setting

To model the unsupervised sequential labeling, Maximum A Posterior (MAP) problem needs to be solved since it incorporates diversity-encouraging prior over parameters of rows the transition matrix. The new objective function is formulated as:

where is the parameters of our proposed dHMM and we adopt the same symbols with the traditional HMM. Note that we ignore the normalization constant of the DPP prior distribution, since it is irrelated to measuring the diversity of parameters of rows of transition matrix, as well as estimating parameters of initial distribution and emission distributions. And is used to balance the weights between measurements of likelihood and diversity-encouraging prior. When , no diversity-encouraging prior will distract the estimation of transition matrix from Maximum Likelihood Estimation (MLE) learning. With goes up, the weight of diversity-encouraging prior increases, and the diversity-encouraging prior will dominate the estimation of the parameters of transition matrix.

3.4.2 Supervised setting

For modeling supervised sequential labeling, as the hidden states are given during training period, parameters can be learned in a count manner. Specifically, is the ratio between the frequency of state and the total number of sequences. is the proportion of the pairwise states among all pairwise states appearing in the training sequences. B can be learned in a discriminative manner, since the observations are independent given hidden states. Obviously, the learned parameters fit the training dataset best, rather than the test dataset. To generalize the counting-computed parameters of transition matrix by incorporating diversity-encouraging prior, we construct the new objective function as below:

where is the trained parameters by with is used to control how far the final A can drift from .

3.5 Solutions

In this subsection, we mainly focus on how to learn parameters from unsupervised setting with objective function Eq.(7) and supervised setting with objective function Eq.(8).

3.5.1 Unsupervised setting

Traditionally, Expectation-Maximization (EM) framework [10] is applied to learn HMM parameters. Here, our procedure is only different with traditional EM in M-step. This is because in a MAP setting, the diversity-encouraging prior term, i.e., is irrelated to the hidden states X, which can be taken out of the integration in E-step. E-step: Given old parameters , we apply

forward-backward algorithm to do inference for hidden

variables. In the forward pass of HMM chain, it inductively sum-

marizes all information before time step t into marginal

distribution over each hidden variable and all past

observation variables , namely,

Similarly, in the backward pass, it summaries information over all future observation variables after time step t, , namely,

The initializations for both forward and backward pass are:

The conditionally marginal probabilities for hidden variables (required by likelihood in Eq.(4)) and likelihood can be computed by combining the forward and backward

Fig. 1: Graphical model of diversified HMM

summarizations. The unary and pairwise hidden states distributions as well as the normalization are formulated as:

M-step: In this step, dHMM optimizes the below objective function to update the parameters given N training sequences.

where, denote hidden state and observation of the nth sequence at time step t respectively, denotes the length of the n-th sample sequence. Note that, the last term of Eq.(4), which is the entropy of marginal distribution q(X), is irrelated to model parameters , is simply ignored in M-step.

As both log function and log det function are concave, we directly apply the Lagrange multipliers method to solve the maximization problem in Eq.(7). The Lagrange function is given below:

where is the Lagrange multipliers.

traditional HMM, we just list the results. For ,

either Gaussian distribution or multinomial distribution. For Gaussian distribution, {1, 2, ..., k}, the updating parameters for B are:

For multinomial distribution, , and the updating parameters for B are:

where is the Dirac delta function. The same results are applied when the emission distributions obey the Bayes Bernoulli distribution since Bernoulli distribution is a special case of multinomial distribution. Namely, with and parameters .

When handling with transition matrix A, the situations for traditional HMM and dHMM are different. Let be the Lagrange function related to parameters A, namely,

puted by

When , the updates for A are the same with traditional HMM:

When , the solution for Eq.(14) has no closed form. We iteratively maximize Eq.(13) with projected gradient ascend method. First the gradients are computed by:

dates for A are

where is the step size, and here we apply adaptive step in our implementation.

Then we project all rows of A onto the probability simplex by finding a nearest point in the simplex for , equally, we try to solve the following optimization problem:

We refer readers to Algorithm 1 in [51] for more details.

The overall procedure for updating transition parameter matrix is summarized in Algorithm 1.

The rows of input is initialized by samples from Dirichlet distribution and as shown, our stop criterion is based on the likelihood contributed by parameters A. The most time-consuming step is to compute the gradients, which are obtained by matrix inversion operation. Fortunately, we usually maintain a small transition matrix to be manipulated.

3.5.2 Supervised setting

To solve the optimization problem in Eq.(8), again the projected gradient ascend method is applied. And the gradients with regard to are computed as:

From above, the pairwise hidden states are counted rather than inferred in the supervised setting. is the same as in the unsupervised setting. Again, we iteratively update A with the initialization by gradient ascend method until converged.

Finally, we apply Viterbi algorithm to find the most likely hidden state sequences by solving the problem for unlabeled sequential observations under both unsupervised and supervised settings.

3.5.3 Convergence analysis

In the unsupervised setting, we maximize which is lower bounded by [9]. The E-step is the same with the general EM algorithm for HMMs. With fixed , in E-step, the exact posterior distributions of hidden states are derived by maximizing the lower bound of the likelihood , i.e., , where is the optimal posterior distributions. In M-step, is maximized by using the gradient ascend algorithm, i.e., , where corresponds to the local maximum of the objective function. Therefore, we can conclude that the EM optimization produces a sequence of objective value that converges to a local maximum.

Similarly, in the supervised setting, the sequence of the objective value produced by the gradient ascend method will also converge to a local maximum.

4 EXPERIMENTS

In this section, we demonstrate the effectiveness of our proposed diversified HMM (dHMM) by conducting experiments on both simulated and real-world datasets.

4.1 Toy experiment

For simulated dataset, {1, 2, 3, 4, 5} as state space, where the cardinality of the state space is k = 5. For the ground-truth initial hidden state distribution, it is set as 0.0912, 0.2421, 0.0652, 0.5914). The transition matrix A is shown in the first column of Fig. (2a). The emission

Fig. 2: Parameters of ground-truth, learned by proposed dHMM and by traditional HMM

probabilities are chosen to be single mode Gaussian distributions. The parameters of means and variances for k Gaussian distributions are set as , and .

300 observation sequences were randomly generated from the ground-truth parameters above. For simplicity, we equally set length of all sequences as six, namely, . The EM framework represented in Sec 3.5 was applied to learn parameters for both the traditional HMM and dHMM. The parameters of mean and variance of the emission distributions were initialized with samples from Gaussian distribution and Gamma distribution respectively. Initial state distribution and rows of transition matrix were sampled from a Dirichlet distribution , where the concentration parameters are set as . For diversified HMM, the balance parameter is set as . The learned parameters are shown in Fig.(2).

4.1.1 dHMM vs. HMM on Toy dataset

Since no label information for the learned parameters, alignment between learned parameters and the ground-truth is applied for visualization of the intuitive comparison. The rows of learned transition matrix are aligned by minimizing the distance between the learned transition matrix and the ground-truth transition matrix. The learned initial distribution and emission distributions are also aligned with the ground-truth ones accordingly.

In Fig.(2a), each column corresponds to one transition matrix and each row corresponds to one state’s transition distribution. The first column denotes the ground-truth transition distribution, and the last two columns are transition matrices learned from the traditional HMM and diversified HMM respectively. Comparing to the result of traditional HMM (the middle column), the diversity-encouraging prior takes effect as illustrated in the third column: The transition distributions of different states are mutually distinct. Until now, it is still hard to decide which model infers the hidden states better, since different hidden structures inferred by different models may inherit different meanings.

In Fig.(2b), each row illustrates the other two kinds of parameters () - the first column denotes the state initial distribution while the other two columns denote the means and covariances of the five emission Gaussian distributions. The first row shows the ground-truth. The second row shows the MLE result learned from the traditional HMM. The last row shows the MAP result learned from the proposed dHMM. From the figure, the traditional HMM identifies only two groups of patterns: The states 1, 2, 3, 4 are in the first group, which have quite similar emission distributions in terms of their Gaussian means and variances. The state 5 is in the second group, which has very different Gaussian parameters comparing to the others. In this case, hidden states 1, 2, 3, 4 are difficult to be differentiated, which can lead to ambiguous labeling results. In contrast, our proposed diversified HMM shows superiority in terms of its higher discriminative ability for differentiating the hidden states involved. This is revealed in the third row of Fig.(2b), which shows that different hidden states induce distinct emission components. From the results obtained, the claim that the proposed diversity-encouraging prior over rows of transition matrix indirectly increases the discrimination of hidden states is justified.

We also compared the proposed diversified HMM with traditional HMM in terms of sequential labeling accuracy. Given the learned parameters, the most likely sequential labels are inferred for each observed sequence by Viterbi algorithm. Both the inferred state frequencies and labeling accuracies of sequential labeling for different models are summarized in Table (1). The histograms of the sequential labels (i.e., the frequencies of hidden states in the given dataset) inferred from ground-truth parameters, parameters learned from traditional HMM and parameters learned from diversified HMM are shown in the second row of Table (1). The ground-truth histogram distributes almost equally amongst the five states, while the statistic of hidden states inferred from traditional HMM tends to be highly biased which favors one dominant state. This problem is somewhat rectified by the proposed diversified HMM. Shown in the third column, the histogram of hidden states inferred from dHMM shows more resemblance to the ground-truth than the histogram inferred from traditional HMM.

To compute the 1-to-1 accuracy of sequential labeling, labels inferred by parameters learned from both the traditional HMM and diversified HMM are aligned to the ground-truth by Hungarian algorithm. As shown in the third row of Table (1), the proposed dHMM outperforms traditional HMM by a large margin.

4.1.2 More explanations on dHMM’s superiority over HMM

Further, in this subsection, the superiority of our proposed dHMM over traditional HMM is statistically illustrated especially in the case where the emission distributions are almost flatten. Under this situation, the hidden states are ambiguous and less discriminative, which leads to that traditional HMM identifies less hidden states than the ground-truth, i.e. the learned transition matrix contains more similar rows than the ground-truth transition matrix does. Not only intuitively but also experimentally, our proposed dHMM mitigates this issue to some extent.

All ground-truth parameters of HMM take the same setting as above except the variances of the emission distributions . The variances of the Gaussian distributions are gradually enlarged to ‘flatten’ the emission distributions. In here, we used a sequence of variance parameters , where . For each t, we generated the experimental sequences by the same method described above. The experimental results are averaged over 10 runs with independent initializations.

We apply averaged Bhattacharyya distance over all pairwise of rows of transition matrix as diversity measure. Higher Bhattacharyya distance means more diversity of rows of transition matrix. The quantized diversities of rows of transition matrix is shown in Fig. (3). The green line shows the diversity of the ground-truth transition matrix whose value is 0.531. The red curve below the green line and the blue curve above the green line show the diversities of the transition matrices learned by traditional HMM and proposed dHMM respectively. The effectiveness of diversity-encouraging prior is obvious: The dHMM consistently outperforms the HMM, no matter what the parameters of variances are set.

Higher diversity of transition matrix implies more inferred hidden states. One example of the histogram of the inferred states is shown in Fig.(4), in which the number of states is identified by omitting the labels whose frequencies are below certain threshold . In our case, we set the threshold as indicated by the black line. We show that the dHMM (colored as Green) identifies all five states, while the HMM (colored as Red) identifies only two states, since the frequencies of states 2, 3, 4 are below the threshold .

We summarize the statistical results in Fig.(5). From the left part of the curve, the emission Gaussian distributions start with low variance. The hidden states can be easily identified and the dHMM performs on par with HMM. However, along with the increasing of the variance, the

0 5 10 15 20 25 30 35 40 45 50 0.2

Fig. 3: Diversities of transition matrix of ground-truth, dHMM-learned and HMM-learned with regard to the parameter of variance of the Gaussian emission distributions

Fig. 4: Histograms of hidden states inferred from parameters of ground-truth, dHMM-learned and HMM-learned

emission Gaussian distributions are becoming increasingly ‘flattened’, which make the states severely ambiguous and hard to identify. Shown in the right half part of the curve, the advantage of our dHMM is becoming more obvious as it identifies more hidden states than the traditional HMM does.

4.2 Real-world experiments

In this section, our proposed diversified HMM (dHMM) is applied to solve real-world sequential labeling problems: PoS tagging under unsupervised setting and OCR under supervised setting.

4.2.1 PoS tagging

Part-of-Speech tagging (PoS tagging) [36] has been used in the linguistics community for a long time. The task is to

TABLE 1: Comparison of state frequencies and accuracies between dHMM and HMM

0 10 20 30 40 50 1

Fig. 5: Number of hidden states inferred by model parameters of dHMM-learned and HMM-learned with regard to the variance of Gaussian emission distributions

automatically assign contextually appropriate grammatical descriptors to words in texts. In fact, PoS tagging usually produces low level semantic information, which can serve as a precursor towards more abstract levels of analysis, e.g., text indexing and retrieval, as nouns and adjectives are better candidates for index terms than adverbs or pronouns are.

The Penn Treebank Wall Street Journal (WSJ) corpus ( [35]) is one of the most widely used datasets for evaluating performance of the statistical language models. The training corpus, tagged by gold standard PoS tags, is utilized to evaluate proposed diversified HMM (dHMM) for Part-of-Speech (PoS) tagging task under unsupervised setting. The vocabulary size of the corpus is around 10K. In Table (2), the PoS tags appeared in the WSJ corpus are listed. Detailed definitions of the abbreviation of tags are annotated in [35]. The tags are preprocessed to reduce the hidden state size from 46 to 15 by combining similar tags. The idx column shows the indexes of the reduced tag set. The PoS column shows the semantic title of the tags. The frequency column shows the frequencies of all tags. From this statistics, 25% tags account for nearly 85% words. All of 3828 sentences

TABLE 2: Summary of PoS tags of WJS corpus

Fig. 6: Sentence example with PoS tags

are used in our experiment, and the sequential length is between . An example sentence with true sequential PoS tags is illustrated in Fig. 6, where the true tags lay behind the corresponding words. Naturally, the transition distribution for different tags are different. Take tags /NNP and /VB as an example, the /NNP has higher probability to be followed or following the same /NNP tag. By contrast, /VB is usually followed by /DT or /IN, and follows /MD, /TO or /RB. This discriminative prior information considered by our model is significantly helpful for sequential labeling, which is verified and demonstrated in detail below.

The number of hidden states is set as k = 15 as enumerated in Table (2). Afterwards, the initial state distribution is a 15-dimension vector and the transition distribution is

0 0.1 1 10 100 1000 0.32

Fig. 7: Effectiveness of for PoS tagging

parameterized by A which is a matrix. The words in the vocabulary are treated as observations and the emission distributions are parameterized by B which is a matrix, where V is the size of vocabulary. The , each row of A and each row of B are randomly initialized by samples from the Dirichlet distribution. We apply the 1-to-1 accuracy measure to quantize our experimental results. Similar with the simulated experiment, the Hungarian algorithm is utilized to map the inferred labels to the ground-truth ones.

First, we test the effectiveness of diversity-encouraging prior over rows of transition matrix in terms of prior weights, namely, values. The labeling accuracies with regard to s are plotted in Fig.(7). The setting corresponds to the setting of traditional HMM. Traditional HMM gets an accuracy of 0.4475, while our proposed dHMM achieves the best accuracy of 0.4688 with . Larger weight () will overemphasize the diversity-encouraging prior over rows of transition matrix and will lead to decreasing the sequential labeling accuracy, as shown with a sharp drop when increases to 1000 in Fig.(7).

Then, we qualitatively demonstrate the effectiveness of our diversity prior by comparing the transition parameters matrix learned from dHMM () to the parameters learned from the baseline - traditional HMM. The diversity measurements (Bhattacharyya distance) between tag 1 and the other tags are shown in Fig. (8). The proposed dHMM identifies that tag 1 (NOUN) is most different from tag 11 (Interjection), while HMM identifies that tag 1 (NOUN) is most different from tag 5 (MODAL). Since the frequency of tag 11 (Interjection) is only 3, intuitively, the transition distribution of this tag is quite different from the transition distributions of other tags. The same situation is applied to tag 9 (Foreign word, whose frequency is 4). From Fig. (8), the proposed dHMM identifies more accuracy result, which indicates that the transition distributions of both tags 11 and 9 are most different from tag 1 (NOUN), than the result learned from the traditional HMM.

Finally, we show how the diversity-encouraging prior indirectly rectifies the emission distributions learned from traditional HMM to fit the dataset better, as illustrated in Fig. (9). Here, we choose to explain the behavior of the proposed dHMM. Three curves are plotted. The statistics of the ‘ground-truth’ curve is obtained through the inferred

Fig. 8: Transition diversity comparison between dHMM and HMM for tag ‘1’ and all other tags

Fig. 9: Histogram comparisons among ground-truth, HMM and dHMM

hidden labels by the true parameters. 1 From the ‘ground-truth’ curve, small portion of tags explain majority of the words, which is pointed out as skewed long-tail distribution [23]. ‘dHMM’ curve learned by our proposed diversified HMM reflects this phenomenon especially for the less frequent 10 tags and shows promising result for unsupervised sequential labeling task. To some extent, the diversified transition prior latently adjusts the flatten distributions for the less frequent 10 tags obtained from traditional HMM to a better trend approaching the true distributions.

4.2.2 OCR

Optical character recognition (OCR) is a task of converting images of typewritten or printed texts (which are the common forms of scanned data, e.g., passport, receipts) into computer-readable texts. It can be applied to many real-world applications, such as efficient data entry for business documents, automatic number plate recognition. To achieve this aim, one has to perform both character segmentation

TABLE 3: Examples of OCR dataset

and recognition tasks. In this work, we assume every character has been segmented out and stored in its own image patch, so that we focus on the recognition task, which is referred as sequential labeling here.

We apply the OCR dataset processed by [48]. They select clean subset from the handwritten words collected by Rob Kassel at the MIT Spoken Language Systems Group. By removing the first capitalized letters, Ben et al. rasterized and normalized images of the rest lowercase letters into images. There are in total 6877 words containing letters. Three word examples are listed in Table (3). The first two columns show two different handwritten patterns from different persons for the word listed in the third column. Apparently, each letter has different probability being transfered to other letters. As highlighted in Table (3), letter ‘m’ has high probability to be followed by ‘m’, ‘a’ or ‘b’, while letter ‘n’ will prefer to be transferred to ‘d’, ‘g’ or ‘i’. Intuitively, we suppose our proposed model will take effect on this dataset and we verify it in the following.

We apply the true number of lowercase English letters as the size of the hidden state space, namely k = 26. Accordingly, the initial state distribution is a 26-d vector , and the transition matrix A is a matrix. Each observed letter image is reshaped into a binary vector. For the emission distributions, Naive Bayes assumption is applied and each dimension of binary vector is independently modeled by Bernoulli distribution, parameterized by , measuring the probability of that the current pixel value is equal to 1. Finally, emission distributions B is modeled by parameters. In supervised setting, the parameters are learned by MLE from the training set. All of our experiments are run with 10-fold cross validation.

Like PoS experiment, we first test the effectiveness of our proposed diversity-encouraging prior with a range of s. The test accuracies are shown in Fig. (10). The results are given by the averages across 10 runs. Another parameter , which tries to drag A to , has been chosen through the 1-to-1 accuracy criterion and is fixed as corresponds to the traditional HMM and it gets an accuracy of 0.7102 while our proposed dHMM obtains an accuracy of 0.7203 with . That an increasing trend is gained demonstrates the effectiveness of dHMM though larger will decrease the performance.

Our proposed model is compared to three baseline algorithms for supervised sequential labeling: Naive Bayes, traditional HMM and Optimized HMM [26]. The average accuracies with standard deviations are shown in Fig. (11). Naive Bayes algorithm ignores the relationship between neighbor letters and achieves the lowest accuracy of 62.7%

0 0.1 1 10 100 1000 0.708

Fig. 10: Effectiveness of for OCR

Fig. 11: Test accuracies of different classifiers

with a standard deviation of 1.1%. Incorporating one-order chain structure of letters, HMM achieves a 70.6% accuracy rate with a standard deviation of 1.3%. With other tricks, the optimized HMM obtains limited improvement. By contrast, by adding diversity-encouraging prior over the rows of transition matrix of traditional HMM, our proposed dHMM achieves an accuracy of 72.06 with a standard deviation of 2.2% which apparently gains a significant margin.

Finally, a qualitative demonstration of the diversity is shown in Fig. (12). The transition matrix A is trained from the setting of . Fig.(12a) (Fig. (12b)) shows the diversity measurements (Bhattacharyya distance) between transition distribution of character ‘x’ (‘y’) and transition distribution of the other 25 letters. From the curves, the total trends almost are the same everywhere between traditional HMM and our proposed dHMM, except that dHMM heights the pairwise diversities between transition distributions of (‘x’,‘g’), (‘x’,‘j’) and (‘y’,‘f’), which we conclude contributes to the improvement of test accuracies in some extent .

5 CONCLUSION

Based on the methodology of traditional HMM, a diversified HMM (dHMM) for sequential labeling was proposed in this paper: Instead of explicitly constraining the parameters

Fig. 12: Transition diversity comparison between dHMM and HMM

associated with the observations, we placed a diversity-encouraging prior over the parameters of transition distribution, modeled by Determinantal Point Processes (DPP), which is an essential part of traditional HMM. To facilitate this variation of HMM, a new Maximum A Posterior (MAP) scheme was proposed under both unsupervised setting and supervised setting. For unsupervised setting, maximum a posterior with marginal likelihood was solved based on the EM framework for the traditional HMM, but with a modified M-step. For supervised setting, maximum a posterior with joint likelihood was trained directly through the gradient descend method. We verified the effectiveness of our proposed dHMM through both the simulated and the real-world datasets (e.g., unsupervised PoS tagging and supervised OCR).

Our future work will involve with the development of a non-parametric extension to dHMM, which simultaneously learns the number of hidden states, as well as all HMM parameters. We will carry out a theoretical study into the effectiveness of the number of states as well as diversity-encouraging prior over rows of transition matrix under our setting with regard to labeling accuracy.

ACKNOWLEDGMENTS

The work was supported in part by the Australian Research Council Projects grants DP-120103730, DP-140102164, FT-130101457, and LP-140100569.

REFERENCES

[1] R. H. Affandi, E. B. Fox, R. P. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In Proceedings of the 31st International Conference on Machine Learning, 2014.

[2] R. H. Affandi, E. B. Fox, R. P. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In arXiv:1402.4862, 2014.

[3] R. H. Affandi, E. B. Fox, and B. Taskar. Approximate inference in continuous determinantal point processes. In Proceedings of Conference on Neural Information Processing Systems, pages 1430– 1438, 2013.

[4] R. H. Affandi, E. B. Fox, and B. Taskar. Approximation inference in continuous determinantal point processes. In Proceedings of Advances in Neural Information Processing Systems, 2013.

[5] R. H. Affandi, A. Kulesza, and E. B. Fox. Markov determinantal point processes. In Proceedings of Conference on Uncertainty in Artificial Intelligence, 2012.

[6] S. Amer-Yahia, F. Bonchi, C. Castillo, E. Feuerstein, I. MendezDiaz, and P. Zabala. Composite retrieval of diverse and complementary bundles. IEEE Transactions on Knowledge and Data Engineering, 2014.

[7] K. Asai, S. Hayamizu, and K. Handa. Prediction of protein secondary structure by the hidden markov model. 1992.

[8] M. Bicego, M. Cristani, and V. Murino. Sparseness achievement in hidden markov models. In Proceedings of 14th International Conference on Image Analysis and Processing, pages 67–72, Sept 2007.

[9] C. M. Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006.

[10] M. Fang, J. Yin, and D. Tao. Active learning for crowdsourcing using knowledge transfer. In Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.

[11] Y. Fang, R. Wang, B. Dai, and X. Wu. Graph-based learning via auto-grouped sparse regularization and kernelized extension. IEEE Transactions on Knowledge and Data Engineering, 27, 2015.

[12] S. Feng and R. Manmatha. A hierarchical, hmm-based automatic evaluation of ocr accuracy for a digital library of books. In Proceedings of the 6th ACM/IEEE-CS joint Conference on Digital Libraries, pages 109–118, 2006.

[13] M. A. Figueiredo and A. K. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis & Machine Intelligence, 24:381–396, 2000.

[14] J. V. Gael, A. Vlachos, and Z. Ghahramani. The infinite hmm for unsupervised pos tagging. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 678–687, 2009.

[15] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectationmaximization for learning determinantal point processes. In Proceedings of Advances in Neural Information Processing Systems, pages 3149–3157, 2014.

[16] J. Gillenwater, A. Kulesza, and B. Taskar. Discovering diverse and salient threads in document collections. In Proceedings of Conference on Empirical Methods in Natural Language Processing, pages 710–720, 2012.

[17] J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal map inference for determinantal point processes. In Proceedings of Advances in Neural Information Processing Systems, pages 2735–2743, 2012.

[18] J. Gillenwater, A. Kulesza, and B. Taskar. Near-optimal map inference for determinantal point processes. In Proceedings of Conference on Neural Information Processing Systems, pages 2735– 2743, 2012.

[19] S. Goldwater and T. L. Griffiths. A fully bayesian approach to unsupervised part-of-speech tagging. Association of Computational Linguistics, pages 744–751, June 2007.

[20] J. B. Hough, M. Krishnapur, Y. Peres, and B. Virag. Determinantal processes and independence. Probability Surveys, 3:206–229, 2006.

[21] J. Hu, M. K. Brown, and W. Turin. Hmm based on-line handwrit- ing recognition. 1992.

[22] T. Jebara, R. Kondor, and A. Howard. Probability product kernels. Journal of Machine Learning Research, 5:819–844, July 2004.

[23] M. Johnson. Why doesn’t em find good hmm pos-taggers? In Proceedings of Joint Conference on Empirical Methods in Natural Language Proceeding and Computational Natural Language Learning, pages 296–305, 2007.

[24] B. Kang. Fast determinantal point processes sampling with application to clustering. In Proceedings of Conference on Neural Information Processing Systems, pages 2319–2327, 2013.

[25] M. Kojima and F. Komaki. Determinantal point process priors for bayesian variable selection in linear regression. In ArXiv:1406.2100, 2014.

[26] E. Krevat and E. Cuzzillo. Improving off-line handwritten charac- ter recognition with hidden markov models. IEEE Transactions on Pattern Analysis & Machine Intelligence, 33, 2006.

[27] A. Krogh, B. Larsson, G. von Heijne, and E. L. L. Sonnham- mer. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. 2012.

[28] A. Kulesza and B. Taskar. Structured determinantal point processes. In Proceedings of Conference on Neural Information Processing Systems, pages 1171–1179, 2010.

[29] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In Proceedings of International Conference on Machine Learning, pages 1193–1200, 2011.

[30] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2– 3), 2012.

[31] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5, 2012.

[32] Z. Li, J. Liu, Y. Yang, X. Zhou, and H. Lu. Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Transactions on Knowledge and Data Engineering, 26, 2014.

[33] X. Liu and T. Chen. Video-based face recognition using adaptive hidden markov models. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pages I–340, 2003.

[34] M. Long, J. Wang, G. Ding, D. Shen, and Q. Yang. Transfer learning with graph co-regularization. IEEE Transactions on Knowledge and Data Engineering, 26, 2014.

[35] M. Marcus, B. Santorini, and M. Marcinkiewicz. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 1993.

[36] R. Mitkov. The oxford handbook of computational linguistics. Oxford University Press, 2003.

[37] S. Morwal, N. Jahan, and D. Chopra. Named entity recognition using hidden markov model (hmm). pages 15–23, 2012.

[38] H. Murveit and R. Moore. Integrating natural language constraints into hmm-based speech recognition. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, pages 573– 576, 1990.

[39] C. A. Napoleon and F. Lavancier. About repulsiveness of determi- nantal point processes. In ArXiv:1406.2796, 2014.

[40] F. Niu and M. Abdel-Mottaleb. Hmm-based segmentation and recognition of human activities from video sequences. In Proceedings of IEEE International Conference on Multimedia and Expo, pages 804–807, 2005.

[41] G.-J. Qi, C. C. Aggarwal, and T. S. Huang. Online community detection in social sensing. In Proceedings of the sixth ACM international conference on Web search and data mining, pages 617–626. ACM, 2013.

[42] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE, volume 77, pages 257–286, 1989.

[43] F. Sha and L. K. Saul. Large margin hidden markov models for automatic speech recognition. In Proceedings of Conference on Neural Information Processing Systems, pages 1249–1256, 2006.

[44] A. Shah and Z. Ghahramani. Determinantal clustering process - a nonparametric bayesian approach to kernel based semi-supervised clustering. In Proceedings of Conference on Uncertainty in Artificial Intelligence, page 566, 2013.

[45] J. Snoek and R. P. Adams. a determinantal point process latent variable model for inhibition in neural spiking data. In Proceedings of Advances in Neural Information Processing Systems, pages 1932– 1940, 2013.

[46] J. Snoek and R. P. Adams. A determinantal point process latent variable model for inhibition in neural spiking data. In Proceedings of Conference on Neural Information Processing Systems, pages 1932– 1940, 2013.

[47] L. Tao, E. Elhamifar, S. Khudanpur, G. D. Hager, and R. Vidal. Sparse hidden markov models for surgical gesture classification and skill evaluation. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, pages 167– 177, 2012.

[48] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov net- works. In Proceedings of Conference on Neural Information Processing Systems, page 25, 2003.

[49] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101(476), 2006.

[50] Q. I. Wang and D. Schuurmans. Improved estimation for unsuper- vised part-of-speech tagging. In Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, pages 219–224, 2005.

[51] W. Wang and M. A. Carreira-Perpin´an. Projection onto the proba- bility simplex: An efficient algorithm with a simple proof, and an application. In arXiv:1309.1541, 2013.

[52] S. Yang, Z. Yi, M. Ye, and X. He. Convergence analysis of graph regularized non-negative matrix factorization. IEEE Transactions on Knowledge and Data Engineering, 26, 2014.

[53] J. Y. Zou and R. P. Adams. Priors for diversity in generative latent variable models. In Proceedings of Advances in Neural Information Processing Systems, pages 2996–3004, 2012.

Maoying Qiao received the B.Eng. degree in Information Science and Engineering from Central South University, Changsha, China, in 2009, and the M.Eng. degree in Computer Science from Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China, in 2012. She is currently pursuing the Ph.D. degree with the University of Technology at Sydney (UTS), Sydney, NSW, Australia. Her current research topics are pattern recognition and probabilistic graphical modeling.

Wei Bian (M’14) received the B.Eng. degree in electronic engineering and the B.Sc. degree in applied mathematics in 2005, the M.Eng. degree in electronic engineering in 2007, all from the Harbin institute of Technology, harbin, China, and the PhD degree in Computer Science in 2012 from the University of Technology, Sydney, Australia. His research interests are pattern recognition and machine learning.

Richard Yi Da Xu received the B.Eng. degree in computer engineering from the University of New South Wales, Sydney, NSW, Australia, in 2001, and the Ph.D. degree in computer sciences from the University of Technology at Sydney (UTS), Sydney, NSW, Australia, in 2006. He is currently a Senior Lecturer with the School of Computing and Communications, UTS. His current research interests include machine learning, computer vision, and statistical data mining.

Dacheng Tao (F’15) is Professor of Computer Science with the Centre for Quantum Computation & Intelligent Systems, and the Faculty of Engineering and Information Technology in the University of Technology, Sydney. He mainly applies statistics and mathematics to data analytics and his research interests spread across computer vision, data science, image processing, machine learning, neural networks and video surveillance. His research results have expounded in one monograph and 100+ publications at prestigious journals and prominent conferences, such as IEEE T-PAMI, TNNLS, T-IP, JMLR, IJCV, NIPS, ICML, CVPR, ICCV, ECCV, AISTATS, ICDM; and ACM SIGKDD, with several best paper awards, such as the best theory/algorithm paper runner up award in IEEE ICDM’07, the best student paper award in IEEE ICDM’13, and the 2014 ICDM 10 Year Highest-Impact Paper Award.

Designed for Accessibility and to further Open Science