Multiview Representation Learning for a Union of Subspaces

2019·Arxiv

Abstract

Abstract

Canonical correlation analysis (CCA) is a popular technique for learning representations that are maximally correlated across multiple views in data. In this paper, we extend the CCA based framework for learning a multiview mixture model. We show that the proposed model and a set of simple heuristics yield improvements over standard CCA, as measured in terms of performance on downstream tasks. Our experimental results show that our correlation-based objective meaningfully generalizes the CCA objective to a mixture of CCA models.

1. Introduction

Multiview, correlation-based representation learning has been shown to be useful on a variety of tasks (Hardoon et al., 2004; Wang et al., 2015b; Arora & Livescu, 2013; 2014; Benton et al., 2016; V´asquez-Correa et al., 2017; Holzenberger et al., 2019). Its main workhorse, Canonical Correlation Analysis (CCA) (Hotelling, 1936), enjoys several non-linear extensions with theoretical guarantees (kernel CCA and deep CCA) (Wang et al., 2015a; Lai & Fyfe, 2000), and can be extended to more than 2 views (Horst, 1961; Rastogi et al., 2015; Benton et al., 2017).

A related approach that yields an arguably richer representation is to learn a mixture model — instead of learning a single, high-dimensional, complex subspace, we can seek to find a union of subspaces. Mixture models, especially Gaussian Mixture Models, have a wide range of applications, from topic modeling (Blei et al., 2003) to speech recognition (Gales et al., 2008). In many cases, data can naturally be described as a union of distributions. For instance, phonemes in speech are a superposition of low-dimensional processes, either explained as templates of time-frequency vectors, or as typical articulatory motions (Sugamura et al., 1983). In the domain of information extraction, documents fall into categories such as newswire, blog posts, agency reports or tweets. In machine translation, domain match or mismatch plays an important part in the performance of a translation system (Koehn & Knowles, 2017). Identifying the underlying components of the mixture, either explicitly or implicitly, can help unsupervised modeling of the data, useful for a host of downstream tasks.

Learning a mixture of subspaces rather than a single subspace also makes personalization simpler. For an unseen example, being able to assign it quickly to a subpopulation can make classification tasks require less data to achieve good performance. For example, in the context of speaker adaptation, being able to assign a speaker to a specific group of speakers improves speech recognition performance (Kuhn et al., 2000).

In this paper, we propose to extend the framework of CCA to learn a union of subspaces. Specifically, we assume that each data point belongs to one of a finite number of sources, and we learn a pair of linear CCA transformations for each source. CCA with a single transformation can detect canonical directions that span the entire dataset. A mixture of CCA is also able to detect the main canonical directions, because they can be found in each subpopulation. In addition, a mixture of CCA can pick up correlations which are present at the subpopulation level but cancel each other out at the population level. This argument would suggest we can provide a much finer grained representation, possibly without increasing the dimensionality of the representation.

While the premise might seem straightforward, combining CCA and mixture models poses a number of challenges: what objective is to be maximized? How to simultaneously learn the cluster assignments and the transformations? Are there any guarantees regarding convergence, and recovery of cluster memberships? Once the parameters of the mixture model are learned, how to assign a new point (x, y) to a cluster? How to assign a single x without corresponding y to a cluster? This paper is meant primarily as a proof of concept, and leaves most of the above questions open. We propose a new objective function, as well as a heuristic way of maximizing it, and test its performance in two distinct settings.

2. Related work

We use Canonical Correlation Analysis (CCA) (Hotelling, 1936) to learn representations for a primary view. We assume that at training time, we are given two views of the same data point. For instance, for a given speech utterance, the audio recording and the articulatory measurements. These two views are represented by random variables -dimensional respectively). Linear CCA seeks two linear transformations U and V such that the components of Uand Vare maximally correlated. Formally, we want to maximize subject to the constraints that

Given a dataset , we define Cthe empirical cross-covariance matrix between X and Y , and Cand Cthe empirical auto-covariance matrices of X and Y , respectively. U and V are given by the k left and right singular vectors of with the largest singular values, multiplied by

At test time, we assume that only the primary view is available. Ideally, we would want to reconstruct the second view with the primary view (Ngiam et al., 2011). However, in general, that is a difficult task — take for example generating speech from text. Instead, it is easier to predict the dependent variate which has the largest multiple correlation. In other words, when no single regression provides a fully adequate solution, CCA is a better objective than predicting one view with the other.

Deep CCA (DCCA) (Andrew et al., 2013) is a natural extension of linear CCA, where one seeks to maximally correlate Uand Vand g are non-linear feature extractors, which can be learned via gradient descent on the CCA objective. It is also natural to extend CCA to multiple views (Horst, 1961).

CCA can also be reformulated as a probabilistic model. Bach & Jordan (2005) show that the solution to the linear CCA objective function is, up to arbitrary rotation and scaling, contained in the maximum-likelihood solution for the parameters of a Gaussian. Podosinnikova et al. (2016) extend this probabilistic formulation of CCA to fit any type of statistical distribution, including when one or both views are discrete.

Klami & Kaski (2007) place the probabilistic model of Bach & Jordan (2005) in a Bayesian setting, with a Dirichlet process, which naturally leads to a mixture of Gaussians. They then maximize log-likelihood under this GMM model using methods from variational inference. As a final step, they extract CCA projections from each Gaussian. This amounts to performing GMM-based clustering first, and then performing CCA on each of the learned clusters. In contrast, we define a correlation-based objective for multiple transformations, and propose to maximize it directly. It remains to be examined whether both objective functions are equivalent, or even whether stationary points of both objective functions are equivalent. In fact, the model of Klami & Kaski (2007) might be suboptimal in terms of our objective. Other works (Wang, 2007; Viinikanoja et al., 2010; Hosino, 2010) have used very similar approaches.

Most related to this work is that of Fern et al. (2005), where the authors use of a mixture of CCA, and provide an optimization heuristic similar to ours. However, their work differs on a number of points. First, Fern et al. (2005) do not provide an objective function to optimize. Second, they do not test the CCA projections in a downstream task, and are thus arguably not concerned with questions pertaining to representation learning. Third, Fern et al. (2005) do not provide any way of assigning a point with a single view to a cluster (i.e., they specify how to assign (x, y) to a cluster, but not how to assign x to a cluster in the absence of y). The main focus of our work is the objective function, as a possible extension of CCA and representation learning method.

3. Mixture model for Canonical Correlation Analysis

We propose to learn a union of subspaces that best characterizes a set of points with two views, using canonical correlation analysis (CCA). It involves assigning cluster memberships to data points, and we limit ourselves to linear versions of CCA. Throughout this paper, we will refer to this method as mixture of CCA (MCCA).

3.1. Objective formulation

We assume that our data consists of two sets of points. Let be the matrices containing the paired views, with (resp. ) being the i-th column of X (resp. Y ). For example, in our application to speech recognition, X represents a set of MFCC frames and Y the corresponding articulatory measurements. We consider a mixture of R subspaces and consider the following CCA formulation. We define the MCCA objective as:

subject to the constraints that

where U , . . . , R}, and

are the CCA transfor- mation matrices associated with each mixture component.

The assignment of data points to mixture components is given by the following scalar weights:

where [p] denotes the set {1, . . . , p}.

The weighted empirical covariance and cross-covariance matrices for each mixture component are given as

and

and the weighted means of the mixture components are given as

At training time, the goal is to find

i.e., to find for each mixture component, both CCA transformation matrices and the assignments for each data point

At test time, given x and/or y, one needs to estimate the

before projecting x (respectively, y) to

To avoid spurious correlations, we regularize the covariance matrices by adding scaled identity matrices, i.e., replacing (respectively, (respectively,

3.2. Learning and inference

The optimization problem in Section 3.1 is nonconvex, and jointly minimizing over all parameters seems daunting from a computational perspective due to intractability. Therefore, we consider the following optimization approach that first estimates , then estimates , as summarized in Algorithm 1. When is fixed, it is well known that we can globally maximize the R CCA objectives over the choice of U and V , as described in Section 2. Similarly, fixing the CCA subspaces, maximizing over is essentially learning a mixture model over the shared representation.

At training time, we first cluster the data points using a CCA projection and k-means clustering, yielding hard assignments for , i.e. . We then learn U and V .

At test time, we need a way of inferring the for a given where or may be missing. In our experimental setting, only x is available at test time. From the properties of the CCA projections, we know that . Heuristically, we expect be drawn from a unit variance Gaussian distribution. Thus, we assign x to the cluster

where is the fraction of points belonging to cluster r. In some of our speech recognition experiments, will be given at training time.

4. Experiments

In this section, we illustrate the possible uses of MCCA, and test its performance against standard, vanilla linear CCA described in Section 2, and referred to as VCCA. All non-linear extensions of CCA – deep CCA and kernel CCA – have significantly more representational power than linear

Table 1. 4 groups used to partition the English phonemes.

CCA. Most likely, deep CCA would outperform all other methods. Thus, for the comparison to non-linear extensions of CCA to be fair, one would have to extend MCCA to non-linear methods. This would pose many more challenges related to optimization. Instead, our intention is to carefully illustrate the linear version, leaving non-linear extensions to future work.

4.1. Phoneme classification

The University of Wisconsin X-ray microbeam database (XRMB) (Westbury et al., 1990) is a set of sound recordings and articulatory measurements, acquired during production of English read speech. The acoustic recordings are processed into sequences of 13 dimensional MFCC frames. The articulatory measurements are the horizontal and vertical displacements of eight pellets affixed to critical articulators (tongue, lips, jaw) at a given point in time, resulting in a 16 dimensional observation vector. Both views are temporally aligned, and thus we have pairs of MFCC frames and articulatory features. This correspondence is the only supervisory signal when learning representations; while the manual annotations of the spoken text are available, we do not use it for feature learning to avoid any task dependence in the learned representations. The location of the pellets in the original dataset go missing for various reasons; we use the completed version of the XRMB dataset due to Wang et al. (2014). The MFCC frames are augmented with deltas and double deltas, and are mean-centered and variance-normalized per speaker. The articulatory features are also mean-centered and variance-normalized per speaker.

In this section, we illustrate the usefulness of learning a union of subspaces, by partitioning the 39 phonemes of the XRMB dataset into 4 groups, as detailed in Table 1. For some experiments, we assume access at training time to an oracle providing us with the correct group for each training instance; in those cases, we are not concerned with inferring . At test time, we also experiment with the presence and absence of said oracle. In the absence of the oracle, we use the heuristic described in Section 3.2 to infer the membership of each point.

Using the notation of Section 3.1, X consists of 7 stacked MFCC frames with deltas and double deltas, centered around the frame of interest, and is thus 273 dimensional vectors. Y consists of 7 stacked articulatory feature vectors, corresponding to the 7 MFCC frames, and is thus 112 dimensional. The training set has 1.4M data points, and each of the 6 cross-validation folds mentioned below has between 79k and 86k data points.

To evaluate the quality of the learned representations, we follow Wang et al. (2015b) and leave out 12 speakers. The other 35 speakers are used to learn representations. The 12 left-out speakers are used to measure how well the learned representations for audio can be used to perform phoneme classification on unseen data. To estimate generalization error, the 12 speakers are partitioned into 6 sets of 2 speakers each, and used in a 6-fold cross-validation, each fold composed of 4 training speakers, 2 dev speakers, and 2 test speakers. We use the 4 train and 2 dev speakers to find the best parameters for a k-nearest neighbor classifier (Malkov & Yashunin, 2018), then measure the score on the dev set. We use the average dev score over the 6 folds to compare hyperparameters of a given method. This procedure allows us to pick the best hyperparameters for a given CCA method (VCCA or MCCA). Methods are then compared on the 6 test sets, with hyperparameters and . We report average and standard deviation of the performance over the 6 dev and test sets. Knn classifier hyperparameters contain the type of distance used (L2 or cosine), the number of neighbors (8, 16, 32, 64, 128 or 256), and whether to append the original MFCC features to the CCA features.

We sweep independently over {1, 0.1, 0.001}. For VCCA, we use the projection of the MFCC features, and experiment with appending the original MFCC features to perform knn classifi-cation. For MCCA, we also experiment with appending the original MFCC features. In addition, when projecting point

x, we have the possibility of mapping x to

or to the concatenation of , which is an Rk dimensional vector. Note that the former requires inferring at test time, while the latter doesn’t. We experiment with both settings, reporting the former as “projection” and the latter as “concatenation”.

Table 2 reports our experimental results. We experiment with the oracle of Table 1 being present or absent at training time (denoted by “oracle” and “no oracle” respectively). When the oracle is absent at training time, we sweep R over {2, 4, 8, 16}. In this setting, both instances of MCCA yield a significant improvement over VCCA. The best dev score for VCCA is achieved with , while the best score for MCCA with oracle is achieved with k = 50 and the same values of and ; the best score without oracle is achieved with and R = 8. There is no reduction in the dimensionality of the representation because the best dev score is achieved by

Table 2. Phoneme classification accuracy on XRMB dataset without oracle at test time. Oracle (resp. no oracle) indicates access (resp. no access) to the oracle at test time.

concatenating all R representations.

The results in Table 2 show that MCCA yields a significant improvement over VCCA, with and without oracle at training time. In practice, it is likely that there is no oracle at training time, either because of a lack of labels or because there is no clear partitioning of the data. The bottom two lines in Table 2 show that the absence of the oracle does not necessarily imply a loss in performance. In fact, learning each point’s assignment to a mixture component yields better results than following the heuristic from Table 1.

If the oracle had been available at test time in addition to training time, the instance of MCCA projection reported in Table 2 would reach on the dev set and accuracy on the test set. This shows that while our test-time heuristic is able to outperform standard CCA, there is still room for improvement. It also shows that, although our MCCA projection results are below our MCCA concatenation results, given correct point assignments, projection is enough to guarantee a useful representation.

In Figure 1, we show perplexity matrices for some of our MCCA models on the test set. On that plot, phonemes are sorted according to their membership in the 4 groups of Table 1. The first column has access to the oracle at training time, and achieves the best score when projecting points with a single CCA transformation. In that case, ideally, we would want the assignments to be 4 disjoint bands following the 4 clusters. However, each phoneme is mostly assigned to groups 1 and 2, which are also the groups with the largest mixing weights. This mis-assignment can be somewhat alleviated by removing the term (see Section 3.2), but worsens the dev score. Columns 2 shows that, given 4 clusters but no access to the oracle, the clustering heuristic used comes up with sharp clusters for consonants. Roughly, cluster 1 contains the labials, cluster 3 contains the alveolars, and the velars are spread out among clusters 1 and 3. Consonants are mostly spread out between clusters. If we do not constrain the number of clusters to be 4, the best performing model without oracle at training time and using projection has 2 clusters, shown in the third column. In that case, we see mostly sharp assignments, i.e. almost all phonemes are assigned to a single cluster, despite no access to any phonemic information. With some exceptions, consonants belong to cluster 1, and vowels fall into either cluster. The fourth column shows cluster assignments for the best performing MCCA model; note that this model concatenates representations, and thus cluster assignments do not matter at test time. Clusters 2 and 7 seem to be unused, and cluster 4 loosely corresponds to labials. Alveolars and velars are spread out among clusters 1 and 8.

Figure 1. Perplexity matrices for various MCCA models on the test set. Rows are labeled with phonemes and columns with mixture components. Rows are normalized to sum to 1. Each model was chosen as the best in its category, as described by the label. Columns 1, 3 and 4 correspond to the models reported in Table 2.

4.2. Twitter data

To provide an illustration of MCCA in a different domain, we use the corpus of Twitter data of Benton et al. (2016). In that paper, various methods are used to build representations of Twitter users, based on their tweets and friend networks. Following the nomenclature of Benton et al. (2016), we use the ego view, i.e. a PCA-based representation of the user’s tweets, and the friends view, i.e. a PCA-based representation of the user’s friend network or graph. We use the ego view as the primary (available at training and test time) and the friends view as the secondary view (available at training time only). Using the notation of 3.1, X (resp. Y ) is the primary (resp. secondary) view. Both are 1000 dimensional. To estimate how much the use of CCA is improving the performance on a given task, we also report results using the primary view at training time, under “raw features”.

4.2.1. USER ENGAGEMENT PREDICTION

To evaluate the learned representations, we perform a user engagement prediction task, using hashtag as a proxy (following Benton et al. (2016)). We have 2 sets (dev and test) of 200 unseen hashtags each, and for each hashtag, a set of 16k unseen users who have used them. The task is to predict, for each hashtag, which users are likely to use it, based on the first 10 users who have used it. We follow the setup of Benton et al. (2016), but report slightly different metrics. Note that because of missing views, we have had to restrict the number of users, including in the test set, and thus the results are not comparable. The train, dev and test set contain respectively 79900, 8220 and 8071 users. For a given method, we use the performance on the dev set to select the best hyperparameters, then compare methods on the test set. We decide whether or not to concatenate the original representation to the CCA projection based on the dev set performance. When concatenating, we scale each representation by its average norm taken over the dataset, in order to balance out the weight of each representation in the cosine similarity.

More specifically: we first project all the users of the dev (resp. test) set using the learned CCA transformations. For each hashtag i in the dev (resp. test) set, we have a list of users who have used them. We pick the first 10, and compute the average of their embeddings; we take this as the representation of the hashtag. Hashtag representations and user representation are separately mean-centered. Then, for each hashtag and each user , we compute the scaled cosine similarity

, yielding a confidence score between 0 and 1. We use that score as a measure of how likely the user j would use the hashtag i. We can compare these scores with the ground truth of which user actually used each hashtag.

We report Recall@1000, mean reciprocal rank, and the area under the ROC curve (ROC-AUC) (Fawcett, 2004) of a multiclass classifier mapping users to hashtags. Recall@1000 and ROC-AUC are bounded between 0 and 100, and mean reciprocal rank between 0 and 1. For all of these metrics, higher is better. For each hashtag, we rank the users based on their confidence score, in decreasing order. Recall@1000 tells us how much the top 1000 users overlap with the ground truth, and so how many relevant answers are present in our top 1000 answers. Mean reciprocal rank tells us how close the first relevant result is to the top of our result list. If we were to build the simplest classifier for hashtag j, it would rule that user i is going to use hashtag j if where has to be set for each hashtag. The ROC-AUC tells us how well the confidence scores of correct users are separated from those of incorrect users. A ROC-AUC score of 100 means we could set the randomly and still classify each user and hashtag correctly, and 50 means we are doing as well as a random classifier. The higher the ROC-AUC score, the better the performance of the classifier will be, regardless of the values of the

In contrast to the experiments in Section 4.1, there is no obvious partitioning of Twitter users on this dataset, and thus we have no oracle at all in this case. We sweep k over {200, 400, 600, 800, 1000}, and and independently over {1, 0.1, 0.01}. In the case of MCCA, we sweep the number of clusters R over {2, 4, 8, 16} and follow Section 3.2 for learning and inference.

We summarize the results of the user engagement prediction task in Table 3. On all three metrics considered, MCCA outperforms VCCA. For Recall@1000 and ROC-AUC, the hyperparameters achieving the best scores are identical. For VCCA, ; for MCCA projection, ; for MCCA concatenation, . For all three, the best scores are achieved with the CCA representation alone (without appending the raw features). In the case of mean reciprocal rank, the best scores are achieved by concatenating the CCA representation and the raw features. For VCCA, ; for MCCA projection, ; for MCCA concatenation, . The large difference in best performing hyperparameters shows how important task-based hyperparameter selection is. In that respect, MCCA has 2 more hyperparameters that can be tuned based on the task: the number of mixture components R, and whether to project x to

Except for mean reciprocal rank, MCCA projection performs consistently worse than VCCA. This could possibly be because the clusters picked by our heuristic have no intrinsic relationship with the hashtags. It is interesting to note that MCCA projection drastically improves the mean reciprocal rank to perfect or almost perfect score, meaning that the top result is a correct answer, almost always. Higher ROC-AUC scores for MCCA concatenation show that this method is better able to separate, for each hashtag, relevant from irrelevant users. This is consistent with higher Recall@1000 scores, which indicate more relevant users

Table 3. User engagement prediction results on Twitter dataset. REC: Recall@1000, ROC-AUC: area under the ROC curve, MRR: mean reciprocal rank.

within the top 1000 results.

4.2.2. FRIEND RECOMMENDATION

In a similar setting, we perform friend recommendation (Benton et al., 2016). 500 user accounts were set aside (250 for each dev and test), and it was recorded which of the dev and test users followed those accounts. The friend recommendation task amounts to predicting which user accounts a given user is likely to follow. The train, dev and test set contain respectively 6522, 82608 and 81985 users. The task setup is identical to user engagement prediction. We sweep over the same values of hyperparameters as for user engagement prediction.

We report our results in Table 4. Based on the results using the raw features, this task is much harder than user engagement prediction, described in 4.2.1. In this setting, MCCA performs better or on par with VCCA. The smaller performance gap between VCCA and MCCA could be explained by the much smaller size of the train set, and possibly the difficulty of obtaining a coherent clustering. For each of the best performing MCCA instances, the clustering at training time shows that one of the clusters collapses to 1, 2 or 3 points. For each method, the best performing hyperparameters vary from one task to the other, without any pattern emerging. Again, this shows the impact of task-based hyperparameter selection.

While the task is quite different from user engagement prediction described in Section 4.2.1, the conclusions regarding VCCA and MCCA remain mostly identical.

5. Conclusion

In this paper, we proposed a novel objective for representation learning, combining CCA and mixture models, in conjunction with simple heuristics to maximize the objective at training time, and use the representations at test time. Evaluating our representations in different settings, we have shown the usefulness of both our objective and our heuris-

Table 4. Friend recommendation results on Twitter dataset. REC: Recall@1000, ROC-AUC: area under the ROC curve, MRR: mean reciprocal rank.

tics. Overall, across tasks, our proposed method performs on par or better than the standard version of CCA.

Our results suggest that the proposed method is a valid objective, potentially a good generalization of the standard CCA objective. With more hyperparameters than the standard CCA objective, and the possibility of informing the mixture components with hierarchical structure present in the data, it has the potential to better adapt to downstream tasks. It would further benefit from a more thorough optimization scheme with provable guarantees, and possibly extensions to non-linear methods.

References

Andrew, Galen, Arora, Raman, Bilmes, Jeff, and Livescu, Karen. Deep canonical correlation analysis. In Int. Conf. on Machine Learning, pp. 1247–1255, 2013.

Arora, Raman and Livescu, Karen. Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains. In 2013 IEEE Int. Conf. on Acoustics, Speech and Sig. Proc., pp. 7135–7139. IEEE, 2013.

Arora, Raman and Livescu, Karen. Multi-view learning with supervision for transformed bottleneck features. In 2014 IEEE Int. Conf. on Acoustics, Speech, and Sig. Proc. (ICASSP), pp. 2499–2503. IEEE, 2014.

Bach, Francis R and Jordan, Michael I. A probabilistic interpretation of canonical correlation analysis. 2005.

Benton, Adrian, Arora, Raman, and Dredze, Mark. Learning multiview embeddings of twitter users. In Proc. 54th Annual Meeting of the Assoc. Comp. Linguistics (Volume 2: Short Papers), volume 2, pp. 14–19, 2016.

Benton, Adrian, Khayrallah, Huda, Gujral, Biman, Reisinger, Dee Ann, Zhang, Sheng, and Arora, Raman. Deep generalized canonical correlation analysis. arXiv preprint arXiv:1702.02519, 2017.

Blei, David M, Ng, Andrew Y, and Jordan, Michael I. La- tent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.

Fawcett, Tom. ROC graphs: Notes and practical consid- erations for researchers. Machine learning, 31(1):1–38, 2004.

Fern, Xiaoli Z, Brodley, Carla E, and Friedl, Mark A. Cor- relation clustering for learning mixtures of canonical correlation models. In Proceedings of the 2005 SIAM Int. Conf. on Data Mining, pp. 439–448. SIAM, 2005.

Gales, Mark, Young, Steve, et al. The application of hidden markov models in speech recognition. Foundations and Trends Rin Sig. Proc., 1(3):195–304, 2008.

Hardoon, David R, Szedmak, Sandor, and Shawe-Taylor, John. Canonical correlation analysis: An overview with application to learning methods. Neural computation, 16 (12):2639–2664, 2004.

Holzenberger, Nils, Palaskar, Shruti, Madhyastha, Pranava, Metze, Florian, and Arora, Raman. Learning from multiview correlations in open-domain videos. In IEEE Int. Conf. Acoustics, Speech, and Sig. Proc. (ICASSP), 2019.

Horst, Paul. Generalized canonical correlations and their applications to experimental data. Journal of Clinical Psychology, 17(4):331–347, 1961.

Hosino, Tikara. High dimensional non-linear modeling with bayesian mixture of CCA. In Int. Conf. on Neural Information Processing, pp. 446–453. Springer, 2010.

Hotelling, Harold. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.

Klami, Arto and Kaski, Samuel. Local dependent compo- nents. In Proceedings of the 24th Int. Conf. on Machine learning, pp. 425–432. ACM, 2007.

Koehn, Philipp and Knowles, Rebecca. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017.

Kuhn, Roland, Junqua, J-C, Nguyen, Patrick, and Niedziel- ski, Nancy. Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech and Audio Proc., 8(6), 2000.

Lai, Pei Ling and Fyfe, Colin. Kernel and nonlinear canoni- cal correlation analysis. International Journal of Neural Systems, 10(05):365–377, 2000.

Malkov, Yury A and Yashunin, Dmitry A. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

Ngiam, Jiquan, Khosla, Aditya, Kim, Mingyu, Nam, Juhan, Lee, Honglak, and Ng, Andrew Y. Multimodal deep learning. In Proc. Int. Conf. Machine Learning, 2011.

Podosinnikova, Anastasia, Bach, Francis, and Lacoste- Julien, Simon. Beyond CCA: moment matching for multi-view models. arXiv:1602.09013, 2016.

Rastogi, Pushpendre, Van Durme, Benjamin, and Arora, Raman. Multiview LSA: Representation learning via generalized CCA. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 556–566, 2015.

Sugamura, Noboru, Shikano, Kiyohiro, and Furui, Sadaoki. Isolated word recognition using phoneme-like templates. In Acoustics, Speech, and Sig. Proc., IEEE Int. Conf. on ICASSP’83., volume 8, pp. 723–726. IEEE, 1983.

V´asquez-Correa, Juan Camilo, Orozco-Arroyave, Juan Rafael, Arora, Raman, N¨oth, Elmar, Dehak, Najim, Christensen, Heidi, Rudzicz, Frank, Bocklet, Tobias, Cernak, Milos, Chinaei, Hamidreza, et al. Multi-view representation learning via GCCA for multimodal analysis of parkinson’s disease. In 2017 IEEE Int. Conf. on Acoustics, Speech, and Sig. Proc. (ICASSP), pp. 2966–2970. IEEE, 2017.

Viinikanoja, Jaakko, Klami, Arto, and Kaski, Samuel. Vari- ational bayesian mixture of robust CCA models. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 370–385. Springer, 2010.

Wang, Chong. Variational bayesian approach to canoni- cal correlation analysis. IEEE Transactions on Neural Networks, 18(3):905–910, 2007.

Wang, Weiran, Arora, Raman, and Livescu, Karen. Recon- struction of articulatory measurements with smoothed low-rank matrix completion. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pp. 54–59. IEEE, 2014.

Wang, Weiran, Arora, Raman, Livescu, Karen, and Bilmes, Jeff. On deep multi-view representation learning. In Int. Conf. on Machine Learning, pp. 1083–1092, 2015a.

Wang, Weiran, Arora, Raman, Livescu, Karen, and Bilmes, Jeff A. Unsupervised learning of acoustic features via deep canonical correlation analysis. In Acoustics, Speech, and Sig. Proc. (ICASSP), 2015 IEEE Int. Conf. on, pp. 4590–4594. IEEE, 2015b.

Westbury, John, Milenkovic, Paul, Weismer, Gary, and Kent, Raymond. X-ray microbeam speech production database. The Journal of the Acoustical Society of America, 88(S1): S56–S56, 1990.