Various work has been done on clustering high-dimensional data; including data where the number of variables p is very large relative to the number of observations n. Some methods for clustering such data are based on the mixtures of factor analyzers model (Ghahramani and Hinton, 1997; McLachlan and Peel, 2000b), which assumes the variability in p variables can be explained by q < p latent variables for each of the G components. Baek et al. (2010) build on work done by Yoshida et al. (2004, 2006) to introduce the closely related mixtures of common factor analyzers model. which further reduces the number of model parameters and can be useful when the mixture of factor analyzers model is not sufficiently parsimonious.
Extensive work has been done on mixture model-based clustering, where a cluster can be viewed as a component within a mixture model (see McNicholas, 2016, for details and discussion). A mixture model-based approach has the advantage of ease of interpretation as well as flexibility, in that mixture models can be used for clustering various types of data and have applications in many different areas. For example, Melnykov (2016) uses a model-based approach for clustering internet users based on the sequences of web-pages they visit, and Cheam et al. (2017) use a model-based approach for clustering spatiotemporal air quality data. Various work has been done on clustering longitudinal data. For example, McNicholas and Murphy (2010) introduce a family of mixture models that utilizes a covariance matrix decomposition that takes into account the relationship between measurements taken at different time points, and Han et al. (2017) introduce a Bayesian semi-parametric method for clustering data that identifies the temporal pattern common to all subjects as well as temporal patterns unique to individual subjects. However, there is a dearth of work specifically for clustering high-dimensional longitudinal data, i.e., data that contain measurements taken at many time points. This paper introduces a method for clustering longitudinal data using a latent Gaussian mixture model that builds on the approaches introduced by Baek et al. (2010) and McNicholas and Murphy (2010). In particular, it allows the variability in measurements taken at a large number of time points, p, to be explained using a smaller number of time points, q.
2.1 Background
The density of a p-dimensional random variable X that arises from a finite mixture model with G
components is of the form
where th mixing proportion with
th component density, and
) is the vector of parameters. The most commonly used
mixture model is the Gaussian mixture model and the density of this model is of the form
where ) is the density of the multivariate Gaussian distribution with mean vector
mean vector
covariance matrix
Pourahmadi (1999, 2000) uses a modified Cholesky decomposition on the covariance matrix a p-dimensional random variable, which can be used to model longitudinal data. This decomposition
is given by
where T is a unique unit lower triangular matrix, the lower triangular elements of which are interpreted as generalized autoregressive parameters, and D is a unique
diagonal matrix, the (strictly positive) diagonal elements of which are interpreted as innovation variances (Pourahmadi, 1999). This allows the value of a measurement taken at time
, to be predicted using the measurements taken at previous time points, denoted
Specifically, the linear
least-squares predictor of is given by
where the are the lower triangular elements of
are the diagonal elements of D, and
ϵ, . . . , p.
To cluster longitudinal data, McNicholas and Murphy (2010) use a Gaussian mixture model
with each component covariance matrix decomposed according to (3), i.e.,
for g = 1, . . . , G. Equivalently, each component precision matrix is given by
Imposing constraints on the matrices results in a family of eight mixture models, ranging from
2 + 1 covariance parameters in the most constrained to
in the least constrained.
Suppose p-dimensional are observed. The mixture of factor analyzers (MFA) model (Ghahramani and Hinton, 1997; McLachlan and Peel, 2000a) assumes that
can be modelled as
with probability . Note that
1 mean vector
a
matrix of factor loadings,
1 latent factors
are independent
the
are independent
diagonal matrix, and the
independent of each other. It follows that the density of is given by
The number of parameters in the MFA model can be further reduced by imposing certain constraints on the mean vector and covariance matrix; this leads to the mixtures of common factor analyzers
(MCFA) model introduced by Baek et al. (2010), which assumes can be modelled as
with probability matrix of factor loadings, the latent factors
are independent
are independent
are independent of each other. Note that
-dimensional vector,
symmetric matrix, and
diagonal matrix. It follows that the density of
is given by
2.2 The Model
The proposed model utilizes the modified Cholesky decomposition on the latent covariance matrix in the MCFA model; that is, unit lower triangular matrix
and diagonal matrix. It follows that the density given in (9) can be written
Using results from Baek et al. (2010), the total number of parameters in the model is
Similarly to the family of mixture models for longitudinal data introduced by McNicholas and Murphy (2010), the value in (11) can be reduced further by imposing constraints on the matrices. Doing so leads to a family of eight mixture models; details of which, including the number of covariance numbers required to be estimated for each, are given in Table 1. Note that, unlike the aforementioned family of mixture models for longitudinal data, these constraints directly impact the latent space.
Table 1: The nomenclature and covariance structure for each member in the family of models, together with the number of free covariance parameters in
2.3 Model Fitting
The models are fitted using an EM algorithm (Dempster et al., 1977; Baek et al., 2010), where the are taken to be the observed data and the component membership labels
and latent factors
are taken to be the missing data, where
= 1 if the ith observation belongs to the
gth component and = 0 otherwise. The complete-data log-likelihood is given by
and its expected value is given by
where the ˆare the expected values of the component membership labels, given by
In the E-step of the algorithm, the ˆare updated according to (14). In the M-step, the model parameter estimates, obtained by maximizing Q, are updated. Maximizing Q with respect to
yields the update
where . Differentiating Q with respect to
, respectively, (and using results from L¨utkepohl, 1996) gives the following score functions, derivations of which are included in the Appendix. Only the VVA model and its derivations are included here, but derivations of the other models in Table 1 follow similarly.
where
and
Note that the joint distribution of given membership in the gth component is given
from which it follows that
and
Using (16) and (17), the parameter estimates are obtained after setting the above score functions to the appropriate zero vector or matrix; further details are given in the Appendix. First, solving
To find ˆ, first denote the lower triangular elements of
1, so that
Then, setting the lower triangular element of ) equal to 0 and solving for the
for denotes the (
. Thus, the lower triangular elements of ˆ
are given by the ˆ
The pseudocode for the EM algorithm is then given by:
initialize ˆ
initialize ˆ
while not converged
compute ˆ
update ˆ
check convergence criterion
end while
Various possible convergence criteria can be used, such as those based on Aitken’s acceleration (Aitken, 1926; B¨ohning et al., 1994; McNicholas et al., 2010). For the simulations done in the following section, using a lack of progress in the (observed) log-likelihood yields good results. Thus, we consider the EM to have converged when there is a lack of progress in the log-likelihood,
i.e., when
where is the log-likelihood value at iteration k.
2.4 Model Selection
After fitting the model for different values of G and q, the Bayesian information criterion (BIC;
Schwarz, 1978) is used to select the “best” model amongst those fitted. It is given by
where is the maximum likelihood estimate of
) is the maximized log-likelihood,
number of free parameters, and n is the number of observations. Because
) surely increases as the number of parameters increases, the second term in (19) provides a penalty for the number of parameters. Note that, assuming equal prior probabilities, the difference in BIC values between two models approximates the Bayes factor for a test concerning those models (Dasgupta and Raftery, 1998). The fitted model with the largest BIC value is selected as the best model.
3.1 Performance Assessment
The Rand index (RI; Rand, 1971) can be used to assess classification performance. It is given by
where the number of pairwise agreements is equal to the number of pairs of observations that belong to the same class and are grouped as such, plus the number of pairs that belong to separate classes and are grouped as such; the total number of pairs is. The adjusted Rand index (ARI; Hubert and Arabie, 1985) corrects for chance agreement between pairs that tends to inflate the value in (20). The ARI has an expected value of 0 under random classification and a value of 1 under perfect classification. Clustering can be performed on data where group memberships are known by treating the data as unlabelled, and the resulting ARI value can then be used to assess the clustering performance of the model.
3.2 Simulated Data Set 1
To illustrate the performance of the proposed model, the EM algorithm outlined in Section 2.3 is run on simulated data and the resulting clustering performance is assessed. Data are generated from G = 4 distributions, where = 150. Each observation consists of p = 11 observed time points which are generated from 3-dimensional latent mean vectors, given by, respectively,
. The matrix of factor loadings
is given by
and the resulting data are shown in Figure 1. The EM algorithm is then run for G = 1, . . . , 6
Figure 1: Simulated Dataset 1. The component memberships are indicated by colour.
components and q = 2, 3, 4 latent time points. For each G and q, multiple starting values, using both random starting values and k-means clustering, are used to initialize the
The model chosen by the BIC is indeed the “correct” model, in that G = 4 and q = 3. This model classifies all the observations correctly, as seen in Table 2, and so the associated ARI is 1. The fitted model consists of 74 parameters in total. When comparing this to the standard (unconstrained) Gaussian mixture model that does not utilize any latent factors, we find that, by
setting G = 4 and p = 11 in
clustering would require 311 parameters.
3.3 Simulated Data Set 2
For the second simulation study, data are again generated from G = 4 distributions, where = 150. This time, each observation consists of p = 30 observed time points which are generated from 7-dimensional latent mean vectors, given by, respectively,
; the matrix of
Table 2: True component memberships cross-tabulated against predicted group memberships for the first simulated data set.
factor loadings is a 30 7 matrix. The data are shown in Figure 2. Note that this clustering problem is somewhat more challenging than the previous one.
Figure 2: Simulated Dataset 2. The true component memberships are indicated by colours and the estimated group memberships are indicated by line style: alternating dashes, small dots, and solid lines represent Groups 1, 2, and 3, respectively.
The EM algorithm is run for G = 1, . . . , 6 components and q = 5, . . . , 9 latent time points, again using multiple runs of both random starting values and k-means clustering to initialize the ˆvalues. The resulting BIC values are shown in Figure 3: two, three, or perhaps four groups, and seven time latent time points, appear to give the best fit for the data. The best model has G = 3 components and q = 7 latent time points, and the resulting clustering performance of this model is summarized in Table 3. The associated ARI is 0.69, indicating good clustering performance despite the “correct” number of components not being chosen. More specifically, we see from Table 3 that observations from components 1 and 4 are almost perfectly classified, while observations from components 2 and 3 are essentially classified into one group.
Figure 3: BIC values versus the number of groups, where the numbers 5, . . . , 9 indicate the number of latent time points.
Table 3: True group memberships (A–D) cross-tabulated against predicted group memberships (1–3) for the second simulated data set.
A total of 298 parameters are required for our fitted model. Without utilizing any latent factors,
the (unconstrained) component covariance matrices alone would require parameters. Thus, the model was able to achieve good clustering performance on this longitudinal data set using a significantly reduced number of parameters.
3.4 Yeast Sporulation Data Set
The final data set consists of 6,118 expressions of yeast genes measured at p = 7 time points, analyzed previously by various researchers to find groups of genes exhibiting similar expression patterns (Mitchell, 1994; Chu et al., 1998; Wakefield et al., 2003; McNicholas and Murphy, 2010).
Figure 4: BIC values versus the number of groups, where the numbers 3, 4, and 5 indicate the number of latent time points.
To test the performance of the proposed model, we follow the analysis of McNicholas and Murphy (2010) and consider all seven time points without the deletion of any genes, and take the negative logarithm, base 2, of each observation prior to analysis. The observations at each time point are also standardized to have mean 0 and variance 1. The model is then fitted for G = 1, . . . , 20 components and q = 3, 4, 5 latent time points. The are again initialized using both random starting values and k-means clustering. The resulting BIC values for each model are shown in
Figure 5: BIC values versus the number of groups for q = 5 latent time points.
Figure 4. The best model, as seen more clearly in Figure 5, has G = 10 groups and q = 5 latent time points. The resulting groups are pictured in Figure 6, and are compared to the groups found by McNicholas and Murphy (2010) in Table 4.
From Table 4, it seems clear that the model introduced herein found different groups of gene expressions. The fitted model requires 226 parameters, while the model that does not utilize any latent factors would require 359 parameters.
A mixture model for clustering high-dimensional longitudinal data was introduced, which has the ability to account for the longitudinal nature of the data being clustered while also providing signifi-cant data reduction. This is achieved using an extension of the mixture of common factor analyzers model, whereby the variability in measurements taken at many time points can be explained using a smaller number of (latent) time points. A modified Cholesky decomposition is used on the latent covariance matrices to account for the relationship between measurements in the latent space. Constraints can be placed on these latent covariance matrices to further reduce the number of pa-
Figure 6: Yeast gene expressions (after taking the negative logarithm, base 2, and standardizing) coloured by estimated group memberships, with estimated group mean trajectories (given by ˆfor g = 1, . . . , 10).
rameters; however, the mixture of common factor analyzers approach used in this model provides significant data reduction even without doing so.
The model was fit to two simulated data sets and one real data set, overall demonstrating good clustering performance using a significantly reduced number of parameters. Use of the model on the first simulated data set resulted in a perfect clustering result, and use of the model on the second simulated data set, which consisted of 30 time points, yielded good clustering performance using only seven latent time points. Use of the model on the yeast sporulation data set allowed clustering of gene expressions to be achieved using a reduced number of parameters.
Further work could include applications to (real) longitudinal data sets consisting of a large number of time points. As well, note that the model outlined here is only useful for clustering univariate longitudinal data—that is, data that contain only one measurement taken at each time point. Anderlucci and Viroli (2015) use a mixture of matrix variate normal distributions to cluster multivariate longitudinal data; thus, future work could involve extending the model introduced here
Table 4: Predicted group memberships (A–M) found by McNicholas and Murphy (2010) cross-tabulated against predicted group memberships found by the new model (1–10), for the yeast data set.
to account for longitudinal data that contain multiple measurements taken at each time point. Further, this model may be extended to include non-Gaussian distributions; for example, Gallaugher and McNicholas (2017) detail clustering using mixtures of skewed matrix variate distributions.
Aitken, A. C. (1926). A series formula for the roots of algebraic and transcendental equations. Proceedings of the Royal Society of Edinburgh 45, 14–22.
Anderlucci, L. and C. Viroli (2015). Covariance pattern mixture models for multivariate longitudinal data. The Annals of Applied Statistics 9(2), 777–800.
Baek, J., G. J. McLachlan, and L. K. Flack (2010). Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1298–1309.
B¨ohning, D., E. Dietz, R. Schaub, P. Schlattmann, and B. Lindsay (1994). The distribution of the likelihood
ratio for mixtures of densities from the one-parameter exponential family. Annals of the Institute of Statistical Mathematics 46, 373–388.
Cheam, A. S. M., M. Marbac, and P. D. McNicholas (2017). Model-based clustering for spatiotemporal data on air quality monitoring. Environmetrics 93, 192–206.
Chu, S., J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. O. Brown, and I. Herskowitz (1998). The transcriptional program of sporulation in budding yeast. Science 282(5389), 699–705.
Dasgupta, A. and A. E. Raftery (1998). Detecting features in spatial point processes with clutter via model- based clustering. Journal of the American Statistical Association 93, 294–302.
Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B 39(1), 1–38.
Gallaugher, M. P. and P. D. McNicholas (2017). A mixture of matrix variate skew-t distributions. arXiv preprint arXiv:1703.08882.
Ghahramani, Z. and G. E. Hinton (1997). The EM algorithm for factor analyzers. Technical Report CRG- TR-96-1, University of Toronto, Toronto, Canada.
Han, S., H. Zhang, W. Karmaus, G. Roberts, and H. Arshad (2017). Adjusting background noise in cluster analyses of longitudinal data. Computational Statistics & Data Analysis 109, 93–104.
Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classification 2(1), 193–218.
L¨utkepohl, H. (1996). Handbook of Matrices. Chicester: John Wiley & Sons.
McLachlan, G. J. and D. Peel (2000a). Finite Mixture Models. New York: John Wiley & Sons.
McLachlan, G. J. and D. Peel (2000b). Mixtures of factor analyzers. In Proceedings of the Seventh International Conference on Machine Learning, San Francisco, pp. 599–606. Morgan Kaufmann.
McNicholas, P. D. (2016). Mixture Model-Based Classification. Boca Raton: Chapman & Hall/CRC Press.
McNicholas, P. D. and T. B. Murphy (2010). Model-based clustering of longitudinal data. The Canadian Journal of Statistics 38(1), 153–168.
McNicholas, P. D., T. B. Murphy, A. F. McDaid, and D. Frost (2010). Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis 54(3), 711–723.
Melnykov, V. (2016). Model-based biclustering of clickstream data. Computational Statistics and Data Analysis 93, 31–45.
Mitchell, A. P. (1994). Control of meiotic gene expression in saccharomyces cerevisiae. Microbiological reviews 58(1), 56–70.
Pourahmadi, M. (1999). Joint mean-covariance models with applications to longitudinal data: unconstrained parameterisation. Biometrika 86(3), 677–690.
Pourahmadi, M. (2000). Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix. Biometrika 87(2), 425–435.
Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850.
Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
Wakefield, J., C. Zhou, and S. Self (2003). Modelling gene expression over time: curve clustering with informative prior distributions. Bayesian statistics 7, 721–732.
Yoshida, R., T. Higuchi, and S. Imoto (2004). A mixed factors model for dimension reduction and extraction of a group structure in gene expression data. In Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference, pp. 161–172.
Yoshida, R., T. Higuchi, S. Imoto, and S. Miyano (2006). ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles. Bioinformatics 22, 1538–1539.