Training Input-Output Recurrent Neural Networks through Spectral Methods

2016·Arxiv

Abstract

Abstract

We consider the problem of training input-output recurrent neural networks (RNN) for sequence labeling tasks. We propose a novel spectral approach for learning the network parameters. It is based on decomposition of the cross-moment tensor between the output and a non-linear transformation of the input, based on score functions. We guarantee consistent learning with polynomial sample and computational complexity under transparent conditions such as non-degeneracy of model parameters, polynomial activations for the neurons, and a Markovian evolution of the input sequence. We also extend our results to Bidirectional RNN which uses both previous and future information to output the label at each time point, and is employed in many NLP tasks such as POS tagging.

Keywords: Recurrent neural networks, sequence labeling, spectral methods, score function.

1 Introduction

Learning with sequential data is widely encountered in domains such as natural language processing, genomics, speech recognition, video processing, financial time series analysis, and so on. Recurrent neural networks (RNN) are a flexible class of sequential models which can memorize past information, and selectively pass it on across sequence steps on multiple scales. However, training RNNs is challenging in practice, and backpropagation suffers from exploding and vanishing gradients as the length of the training sequence grows. To overcome this, either RNNs are trained over short sequences or incorporate more complex architectures such as long short-term memories (LSTM). For a detailed overview of RNNs, see [20]. Figure 1 contrasts the RNN with a feedforward neural network which has no memory.

On the theoretical front, understanding of RNNs is at best rudimentary. With the current techniques, it is not tractable to analyze the highly non-linear state evolution in RNNs. Analysis of backpropagation is also intractable due to non-convexity of the loss function, and in general, reaching the global optimum is hard. Here, we take the first steps towards addressing these challenging issues. We design novel spectral methods for training IO-RNN and BRNN models.

We consider the class of input-output RNN or IO-RNN models, where each input in the sequence has an output label . These are useful for sequence labeling tasks, which has many applications such as parts of speech (POS) tagging and named-entity recognition (NER) in NLP [21], motif finding in protein analysis [9], action recognition in videos [16], and so on.

In addition, we also consider an extension of IO-RNN, viz., the bi-directional RNN or BRNN, first proposed by Schuster and Paliwal [22]. This includes two classes of hidden neurons: the first class receives recurrent connection from previous states, and the second class receives it from next steps. See Figure 1(c). BRNN is useful in NLP tasks such as POS tagging, where both previous and next words in a sentence have an effect on labeling the current word.

Figure 1: Graphical representation of a Neural Network (NN) versus an Input-Output Recurrent Neural Network (IO-RNN) and a Bidirectional Recurrent Neural Network (BRNN)

In this paper, we develop novel spectral methods for training IO-RNN and BRNN models. Spectral methods have previously been employed for unsupervised learning of a range of latent variable models such as hidden Markov models (HMM), topic models, network community models, and so on [4]. The idea is to decompose moment matrices and tensors using computationally efficient algorithms. The recovered components of the tensor decomposition yield consistent estimates of the model parameters. However, a direct application of these techniques is ruled out due to non-linearity of the activations in the RNN.

Recently, Janzamin et al. [15] derived a new framework for training input-output models in the supervised framework. It is based on spectral decomposition of moment tensors, obtained after certain non-linear transformations of the input. These non-linear transformations take the form of score functions, which depend on generative models of the input. This provides a new approach for transferring generative information, obtained through unsupervised learning, into discriminative training on labeled samples. Based on the aforementioned approach, Janzamin et al. [14] provided guaranteed risk bounds for training two-layer feedforward neural network models with polynomial sample and computational bounds. The conditions for obtaining the risk bounds are mild: a small approximation error for the target function under the given class of neural networks, a generative input model with a continuous distribution, and general sigmoidal activations at the neurons.

We propose new spectral approaches for training IO-RNNs in both classification and regression settings. The previous score function approach for training feedforward networks (as described above) does not immediately extend, and there are some non-trivial challenges: (i) Non-linearity in a RNN is propagated along multiple steps in the sequence, while in the two-layer feedforward network, non-linearity is applied only once to the input. It is not immediately clear on how to “untangle” all these non-linearities and obtain guaranteed estimates of the network weights. (ii) Learning bidirectional RNNs is even more challenging since recursive non-linearities are applied in both the directions, and (iii) Assumption of i.i.d. input and output training samples is no longer applicable, and analyzing concentration bounds for samples generated from a RNN with non-linear state evolution is challenging. We address all these challenges concretely in this paper.

1.1 Summary of Results

Our main contributions are: (i) novel approaches for training input-output RNN and bidirectional RNN models using tensor decomposition methods, (ii) guaranteed recovery of network parameters with polynomial computational and sample complexity, and (iii) transparent conditions for successful recovery based on non-degeneracy of model parameters and bounded evolution of hidden states.

Score function transformations: Training input-output neural networks under arbitrary input is computationally hard. On the other hand, we show that training becomes tractable through spectral methods when the input is generated from a probabilistic model on a continuous state space. This paper can be considered as study of what it takes to uncover the nonlinear dynamics in the system. Since learning under arbitrary input is extremely challenging, we seek to discover under what functions/information of the input the problem becomes tractable. Although this differs from usual approach in training IO-RNNs, this is a promising first step towards demystifying these widely-used models. We show that with some knowledge of the input distribution, we can solve the extremely hard problem of training nonlinear IORNNs. We assume knowledge of score function forms, which correspond to normalized derivatives of the input probability density function (p.d.f). For instance, if the input is standard Gaussian, score functions are given by the Hermite polynomials. There are many unsupervised approaches for estimating the score function efficiently, see Appendix E.1. To estimate the score function, one does not need to estimate the density, and this distinction is especially crucial for models where the normalizing constant or the partition function is intractable to compute. Guarantees have been derived for estimating score functions of many flexible model classes such as infinite dimensional exponential families [26]. In addition, in many settings, we have control over designing the input distribution and our method is directly applicable.

We assume a Markovian model for the input sequence on a continuous state space. For a Markovian model, the score function only depends on the Markov kernel, and has a compact representation, as seen in Section 3.2. The method readily leads to higher order Markov chains. In the main paper we discuss the first order Markov chain for notation simplicity and discuss the extension in Appendix D.3.

Tensor decomposition: We form cross-moments between the output label and score functions of the input. For a vector input, the first order score function is a vector, second order is a matrix, and higher orders corresponds to tensors. Hence, the empirical moments are tensors, and we perform a CP tensor decomposition to obtain the rank-1 components. Efficient algorithms for tensor decomposition have been proposed before, based on simple iterative updates such as tensor power method [4]. After some simple manipulations on the components, we provide estimates of the network parameters of RNN models. The overall algorithm involves simple linear and multilinear steps, is embarrassingly parallel, and is practical to implement [29].

Recovery guarantees: We guarantee consistent recovery under (a low order) polynomial and computational complexity. We consider the realizable setting when samples are generated by a IO-RNN or a BRNN under the following transparent conditions: (i) one hidden layer of neurons with a polynomial activation function, (ii) Markovian input sequence, (iii) full rank weight matrices on input, hidden and output layers, and (iv) spectral norm bounds on the weight matrices to ensure bounded state evolution.

Currently, the question of approximation bounds by a RNN with a fixed number of neurons is not satisfactorily resolved [11] and it is valid to first consider the realizeable setting for this complex problem. The polynomial activations are a departure from the usual sigmoidal units, but they can also capture non-linear signal evolution, and have been employed in different applications, e.g., [10], [30]. The Markovian assumption on the input limits the extent of dependence and allows us to derive concentration bounds for our empirical moments. The full rank conditions on the weight matrices imply non-degeneracy in the neural representation: the weights for any two neurons cannot linearly combine to generate the weight of another neuron. Such conditions have previously been derived for spectral learning of HMMs and other latent variable models [4]. Moreover, it can be easily relaxed by considering higher order moment tensors, and is relevant when we want to have more neurons than the input dimension in our network. The rank assumption on the output weight matrix implies a vector output of sufficient dimensions, i.e., sufficient number of output classes. This can be relaxed to a scalar output, the details are given in Appendix E.2.

The spectral norm condition on the weight matrices arises in the analysis of concentration bounds for the empirical moments. Since we assume polynomial state evolution, it is important to ensure bounded values of the hidden states, and this entails a bound on the spectral norm of the weight matrices. We employ concentration bounds for functions of Markovian input from [19, 18] and combine it with matrix Azuma’s inequality [28] to obtain concentration of empirical moment tensors. This implies learning RNNs with polynomial sample complexity.

Related work: The following works are directly relevant to this paper. (a) Spectral approaches for sequence learning: Previous guaranteed approaches for sequence learning mostly focus on the class of hidden Markov models (HMM). Anandkumar et al. [4] provide a tensor decomposition method for learning the parameters under non-degeneracy conditions, similar to ours. This framework is extended to more general HMMs in [12]. While in a HMM, the relationship between the hidden and observed variables can be modeled as a linear one, in a RNN it is non-linear. However, in a IO-RNN, we have both inputs and outputs, and that is helpful in handling the non-linearities. (b) Input-output sequence models: A rich set of models based on RNNs have been employed in practice in a wide range of applications. Lipton et al. [20] provides a nice overview of these various models. Balduzzi and Ghifari [8] recently apply physics based principles to design RNNs for stabilizing gradients and getting better training error. However, a rigorous analysis of these techniques is lacking.

2 Preliminaries

Let [n] := {1, 2, . . ., n}, and denote the inner product of vectors u and v. For sequence of n vectors , we use the notation z[n] to denote the whole sequence. For vector refers to elementwise power of v. For matrix , the j-th column is referred by or ], the row is referred by or ]. Throughout this paper, denotes the order derivative operator w.r.t. variable x.

Tensor: A real is a member of the outer product of Euclidean spaces . The different dimensions of the tensor are referred to as modes.

) means that is a tensor of order l that is made by reshaping tensor such that the first mode of includes modes of that are shown in , the second mode of includes modes of that are shown in and so on. For example if is a tensor of order 5, [1 2], 3, [4 5]) is a third order tensor, where its first mode is made by concatenation of modes 1, 2 of and so on.

Tensor rank: A 3rd order tensor is said to be rank-1 if it can be written in the form where represents the outer product, and are unit vectors. A tensor is said to have a CP (Candecomp/Parafac) rank k if it can be (minimally) written as the sum of k rank-1 tensors

Note that , where v is repeated p times.

Definition 1 (Row-wise Kronecker product) Row-wise Kronecker product

Let be rows of A, B respectively. Rows of are of the form . Note that our definition is different from usual definition of Khatri-Rao product which is a column-wise Kronecker product.

Derivative: For function g(x) : with vector input , the m-th order derivative w.r.t. variable x is denoted by (which is a m-th order tensor) such that

Tensor as multilinear form: We view a tensor as a multilinear form. Consider matrices . Then tensor is defined as

In particular, for vectors , we have

which is a multilinear combination of the tensor mode-1 fibers. Similarly is a multilinear combination of the tensor entries, and is a linear combination of the tensor slices.

2.1 Problem Formulation

We consider a two-layer input-output RNN, that includes both regression and classification settings:

where poly) denotes an element-wise polynomial of order l, The input sequence x consists of the vectors and hence and . We can learn the parameters of the model using our method. Our algorithm is called GLOREE (Guaranteed Learning Of Recurrent nEural nEtworks) and is shown in Algorithm 1.

Throughout the paper, we assume that the p.d.f. of the input sequence vanishes in the boundary (i.e., when any coordinate of the input goes to infinity). This is also the assumption in [15]. We consider the case where input is a geometrically ergodic Markov chain. Then in order to have mixing and assure ergodicity for the output, we need to impose additional constraints on the model.

2.2 Review of Score functions

As mentioned in the introduction, our method builds on the method introduced by Janzamin et al. [15] called FEAST (Feature ExtrAction using Score function Tensors). The goal of FEAST is to extract discriminative directions using the cross-moment between the label and score function of input. Score function is the normalized (higher order) derivative of p.d.f. of the input.

Let p(x) denote the joint probability density function of random vector . Janzamin et al. [15] denote ) as the order score function, given by

where denotes the order derivative operator w.r.t. variable x. It can also be derived using the recursive form

The importance of score function is that it provides a derivative operator. Janzamin et al. [15] proved that the cross-moment between the label and the score function of the input yields the information regarding derivative of the label w.r.t. the input.

Theorem 1 (Yielding differential operators [15]) Let be a random vector with joint density function p(x). Suppose the order score function ) defined in (1) exists. Consider any order m continuously differentiable tensor function G(x) : . Then, under some mild regularity conditions, we have

3 Extension of score function to input sequences

3.1 Score function form for RNN

We now extend the notion of score function to handle sequence data with non i.i.d. samples. We denote the score function at each time step t in the sequence as ), where , and it is defined below. Theorem 1 can be readily modified to

Lemma 2 (Score function form for input sequence) For vector sequence , let ) and ]) respectively denote the joint density function and the corresponding order score function. Then, under some mild regularity conditions, for all continuously differentiable functions ), we have

3.2 Score function form for Markov chains

We assume a Markovian model for the input sequence, and derive compact score function forms for (3). Note that this form can be readily expanded to higher-order Markov chains.

Lemma 3 (Score function for first-order Markov Chains) Let the input sequence be a first-order Markov chain. The score function in (3) simplifies as

The proof follows the definition of first-order Markov chain and Equation (3).

4 Algorithm and Guarantees

In this paper, we have functions which map input sequence to an output sequence . By assuming a structured form of function mapping in terms of IO-RNN, we can hope to recover the function parameters efficiently. We exploit the score function forms derived above to compute partial derivatives of the output sequence.We first start with some simple intuitions.

4.1 Preliminary insights

Generalized linear model: Before considering the RNN, consider a simple generalized linear model with i.i.d. samples: ), where is the weight matrix and ) is element-wise activation. Here, the partial derivative of E[y|x] w.r.t. x has a linear relationship with the weight matrices and , i.e.,

The first partial derivative is obtained by forming the cross moment )], as given by Theorem 1. The form in (5a) yields and up to a linear transformation. But, by computing second order derivatives, we can recover and , up to scaling of their rows. The second order derivative has the form in (5b). The tensor decomposition in [4] uniquely recovers up to scaling of rows, under full row rank assumptions.

Recovering input-output weight matrices in IO-RNN: The above intuition for GLM readily carries over to IO-RNN. Recall that the IO-RNN has the form

where polydenotes any polynomial function of degree at most l.

Suppose we have access to partial derivatives ]] and ]], then they have the same forms as (5a) and (5b).This is because does not depend on given . Thus, the weight matrices and can be easily recovered by forming )], as given by (3), and it has a compact form for Markovian input in (4). Note that this intuition holds for any non-linear element-wise activation function, and we do not require it to be a polynomial at this stage.

Recovering hidden state transition matrix in IO-RNN: Recovering the transition matrix U is much more challenging as we do not have access to hidden state sequence h[n]. Thus, we cannot readily form partial derivatives of the form ]. Also, the non-linearities get recursively propagated along the chain. Here, we provide an algorithm that works for any polynomial activation function of fixed degree l.

The main idea is that we attempt to recover U by considering partial derivatives ], i.e., how the previous input affects current output . At first glance, this appears complicated and indeed, the various terms are highly coupled and we do not have a nice CP tensor decomposition form. However, we can prove that when the derivative order m is sufficiently large, a nice CP tensor form emerges out, and this m depends on the degree l of the polynomial activation.

For simplicity, we provide intuitions for the quadratic activation function (is a degree-4 polynomial function of , since the activation function is applied twice. By considering fourth order derivative ], many coupled terms vanish since they correspond to polynomial functions of degree less than 4. The surviving term has a nice CP tensor form, and we can recover U efficiently from it. Note that this fourth order partial derivative can be computed efficiently using fourth order score function and forming the cross-moment 1)]. The precise algorithm is given in Algorithm 1.

4.2 Training IO-RNNs

We now provide our algorithm for training IO-RNNs. In this paper we consider vector output and polynomial activation functions of any order 2. For simplicity of notation, we first discuss the case for quadratic activation function. Our algorithm is called GLOREE (Guaranteed Learning Of Recurrent nEural nEtworks) and is shown in Algorithm 1 for quadratic activation function. The general algorithm and analysis for 2 is shown in Appendix C.1. For completeness, we handle the linear case in Appendix E.3. We also cover the case where output y is a scalar (e.g. a binary label) in Appendix E.2.

Now, we consider an RNN with quadratic activation function and vector output. Let

where and . We can learn the parameters of the model using GLOREE. Let n be the window size for RNN.

Theorem 4 (Learning parameters of RNN for quadratic activation function) Let R be the model describing the IO-RNN as in (6). Assuming that are full column rank, we can recover the parameters of R using Algorithm 1 (GLOREE).

Proof Sketch: We have the following properties for an IO-RNN:

Hence, we can recover via tensor decomposition, assuming that they are full row rank. In order to learn U we form the tensor )], and under quadratic activations, we have

Hence, we can recover ) via tensor decomposition. Since is previously recovered using (7), and (exists due to full rank assumption, we can recover U. Thus, Algorithm 1 (GLOREE) consistently recovers the parameters of IO-RNN under quadratic activations. For proof, see Appendix B.1.

4.3 Training Bidirectional RNNs

Bidirectional Recurrent Neural network was first proposed by Schuster and Paliwal [22]. Here there are two groups of hidden neurons.The first group receives recurrent connections from previous time steps while the other from the next time steps. The following equations describe a BRNN

where and denote the neurons that receive forward and backward connections, respectively. Note that BRNNs cannot be used in online settings as they require knowledge of the future steps. However, in various natural language processing applications such as part of speech BRNNs are effective models since they consider both past and future words in a sentence.

We can learn the parameters of bidirectional RNN by modifying our earlier algorithm. For notation simplicity, Algorithm 2 shows the case for quadratic activation functions ) and ). The general polynomial function is considered in Algorithm 4 in Appendix C.2.

Let us provide some intuitions. If we only had forward or backward connections, we would directly apply our previous method in GLOREE. For backward connections, the only difference would be to use derivatives of ] w.r.t. to learn the transition matrix V . Now that we have both hidden neurons mixing to yield the output vector , the cross-moment tensor )] has a CP decomposition where

the factor matrix for the first mode is , i.e., the tensor has a specific form:

corresponds to the tensor incorporating columns of and incorporates columns of . Hence, under full rank assumption, as before, we can recover . Next, we can invert to recover and . We decompose them to recover and . The steps for recovering U and V remains the same as before in GLOREE. The only difference is to use derivatives of ] w.r.t. to learn V .

Theorem 5 (Training BRNN) Let B be the BRNN model in (8). Assuming that are full column rank, we can recover the parameters of B using Algorithm 2. For proof, see Appendix B.2.

4.4 Analysis of GLOREE

In order to analyze the sample complexity, we first need to prove concentration bound for the cross-moment tensor, then we use analysis of the tensor decomposition to show the that sample complexity is a low order polynomial of corresponding parameters.

1. Bounded hidden variables : since we assume a polynomial activation function, the hidden state can grow in an unbounded manner. To avoid this we need the following:

(a) Without loss of generality, we assume that the input sequence is bounded by 1 with high probability, ]

2. Concentration of moment tensor: Deriving concentration bound for the cross-moment tensor is a bit more involved since we have a non-i.i.d. sequence.

Let G be the geometric ergodicity and is the concentration coefficient of the input Markov chain (see Appendix D for definition). We have that:

Theorem 6 (Sample Complexity for GLOREE) Assume the conditions above are met. Suppose the sample complexity n is

Proof Sketch: The proof has two main parts. First, we need to prove a concentration bound for the moment tensor. Second, we can readily use the analysis of tensor decomposition from earlier works such as [4, 14] to compute the sample complexity for this moment tensor. Since the first part is the contribution of this paper, here we focus on that.

In order to prove the concentration bound for the moment tensor, note that our input sequence x[n] is a geometrically ergodic Markov chain. We can think of the empirical moment ˆ)] as functions over the samples x[n] of the Markov chain. Note that this assumes h[n] and y[n] as deterministic functions of x[n], and our analysis can be extended when there is additional randomness. Kontorovich and Weiss [18] provide the result for scalar functions and this is an extension of that result to matrix-valued functions. We use Assumptions 1(a)-(c) to ensure a bounded hidden variable. Next, by leveraging Assumptions 2(a)-2(e) we prove that the cross-moment tensor satisfies Lipschitz property, which paves the way for proving the concentration bound. For details, see Appendix D.

The computational complexity of our method is related to the complexity of the tensor decomposition methods. See [4, 14] for a detailed discussion. It is popular to perform the tensor decomposition in a stochastic manner by splitting the data into mini-batches and reducing computational complexity. Starting with the first mini-batch, we perform a small number of tensor power iterations, and then use the result as initialization for the next mini-batch, and so on. We assume that score function is given to us in an efficient form. Note that if we can write the cross-moment tensor in terms of summation of rank-1 components, we do not need to form the whole tensor explicitly. As an example, if input follows Gaussian distribution, the score function has a simple polynomial form, and the computational complexity of tensor decomposition is ), where n is the number of samples and R is the number of initializations for the tensor decomposition. Similar argument follows when the input is mixture of Gaussian distributions.

5 Conclusion

This work is a first step towards answering challenging questions in sequence modeling. We propose the first method that can recover parameters of IO-RNN as well as BRNN with guarantees. Many of the assumptions can be relaxed, e.g., here we assumed IO-RNNs with aligned inputs and outputs. We can relax this assumption to obtain more general RNNs.

This paper opens up a new horizon for future research, such as extending this framework to HMMs and general settings and analysis under non-stationary inputsWe have assumed the realizable setting where samples are generated from a RNN. The question of approximation bounds by a RNN with a fixed number of neurons is an interesting problem.

Acknowledgment

The authors thank Majid Janzamin for discussions on sample complexity and constructive comments on the draft. We thank Ashish Sabharwal for editorial comments on the draft. This work was done during the time H. Sedghi was a visiting researcher at University of California, Irvine and was supported by NSF Career award FG15890. A. Anandkumar is supported in part by Microsoft Faculty Fellowship, NSF Career award CCF-1254106, ONR award N00014-14-1-0665, ARO YIP award W911NF-13-1-0084, and AFOSR YIP award FA9550-15-1-0221.

References

[1] Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data generating distribution. arXiv preprint arXiv:1211.4246, 2012.

[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models. J. of Machine Learning Research, 15:2773–2832, 2014.

[3] Anima Anandkumar, Yi-kai Liu, Daniel J Hsu, Dean P Foster, and Sham M Kakade. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems, pages 917–925, 2012.

[4] Anima Anandkumar, Rong Ge, Daniel Hsu, Sham M. Kakade, and Matus Telgarsky. Tensor decom- positions for learning latent variable models. Journal of Machine Learning Research, 15:2773–2832, 2014.

[5] Anima Anandkumar, Rong Ge, and Majid Janzamin. Sample Complexity Analysis for Learning Over- complete Latent Variable Models through Tensor Methods. arXiv preprint arXiv:1408.0553, Aug. 2014.

[6] Anima Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed Non-Orthogonal Tensor Decomposi- tion via Alternating Rank-1 Updates. arXiv preprint arXiv:1402.5180, Feb. 2014.

[7] Kamyar Azizzadenesheli, Alessandro Lazaric, and Anima Anandkumar. Reinforcement learning of pomdp’s using spectral methods. arXiv preprint arXiv:1602.07764, 2016.

[8] David Balduzzi and Muhammad Ghifari. Strongly-typed recurrent neural networks. arXiv preprint arXiv:1602.02218, 2016.

[9] Asa Ben-Hur and Douglas Brutlag. Sequence motifs: highly predictive features of protein function. In Feature Extraction, pages 625–645. Springer, 2006.

[10] D. Chen and C. Manning. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750, 2014.

[11] Barbara Hammer. On the approximation capability of recurrent neural networks. Neurocomputing, 31 (1):107–123, 2000.

[12] Qingqing Huang, Rong Ge, Sham Kakade, and Munther Dahleh. Minimal realization problem for hidden markov models. In Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on, pages 4–11. IEEE, 2014.

[13] Aapo Hyv¨arinen. Estimation of non-normalized statistical models by score matching. In Journal of Machine Learning Research, pages 695–709, 2005.

[14] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods. Preprint available on arXiv:1506.08473, June 2015.

[15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Score Function Features for Discriminative Learning: Matrix and Tensor Frameworks. arXiv preprint arXiv:1412.2863, Dec. 2014.

[16] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei- Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.

[17] George Konidaris and Finale Doshi-Velez. Hidden parameter markov decision processes: An emerging paradigm for modeling families of related tasks. In 2014 AAAI Fall Symposium Series, 2014.

[18] Aryeh Kontorovich and Roi Weiss. Uniform chernoff and dvoretzky-kiefer-wolfowitz-type inequalities for markov chains and related processes. Journal of Applied Probability, 51(4):1100–1113, 2014.

[19] Leonid Aryeh Kontorovich, Kavita Ramanan, et al. Concentration inequalities for dependent random variables via the martingale method. The Annals of Probability, 36(6):2126–2158, 2008.

[20] Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.

[21] C. Manning and H. Sch¨utze. Foundations of statistical natural language processing, volume 999. MIT Press, 1999.

[22] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 45(11):2673–2681, 1997.

[23] Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse connectivity. NIPS workshop on Deep Learning and Representation Learning, Dec. 2014.

[24] L. Song, A. Anandkumar, B. Dai, and B. Xie. Nonparametric estimation of multi-view latent variable models. arXiv preprint arXiv:1311.3287, Nov. 2013.

[25] D. Spielman, H. Wang, and J. Wright. Exact recovery of sparsely-used dictionaries. In Conference on Learning Theory, 2012.

[26] Bharath Sriperumbudur, Kenji Fukumizu, Revant Kumar, Arthur Gretton, and Aapo Hyv¨arinen. Den- sity estimation in infinite dimensional exponential families. arXiv preprint arXiv:1312.3516, 2013.

[27] Kevin Swersky, David Buchman, Nando D Freitas, Benjamin M Marlin, et al. On autoencoders and score matching for energy based models. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1201–1208, 2011.

[28] J. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–434, 2012.

[29] Yining Wang, Hsiao-Yu Tung, Alexander Smola, and Anima Anandkumar. Fast and guaranteed tensor decomposition via sketching. In Proc. of NIPS, 2015.

[30] Qizhe Xie, Kai Sun, Su Zhu, Lu Chen, and Kai Yu. Recurrent polynomial network for dialogue state tracking with mismatched semantic parsers. In 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, page 295, 2015.

A Notation

For completeness, all the notation required in the paper and Appendices are gathered here as well.

Let [n] := {1, 2, . . ., n}, and denote the or Euclidean norm of vector u, and denote the inner product of vectors u and v. For sequence of n vectors , we use the notation z[n] to denote the whole sequence. For vector refers to elementwise power of v. For matrix , the j-th column is referred by or ], the row is referred by or ] and denotes the spectral norm of matrix C. Throughout this paper, denotes the order derivative operator w.r.t. variable x.

Tensor: A real is a member of the outer product of Euclidean spaces . The different dimensions of the tensor are referred to as modes. For instance, for a matrix, the first mode refers to columns and the second mode refers to rows.

Tensor matricization: For a third order tensor , the matricized version along first mode denoted by is defined such that

and we use Mat to show matricization, i.e., M = (Mat)(T )

) means that is a tensor of order l that is made by reshaping tensor such that the first mode of includes modes of that are shown in , the second mode of includes modes of that are shown in and so on. For example if is a tensor of order 5, [1 2], 3, [4 5]) is a third order tensor, where its first mode is made by concatenation of modes 1, 2 of and so on.

Tensor rank: A 3rd order tensor is said to be rank-1 if it can be written in the form

where represents the outer product, and are unit vectors. A tensor is said to have a CP (Candecomp/Parafac) rank k if it can be (minimally) written as the sum of k rank-1 tensors

Note that , where v is repeated p times.

Definition 2 (Row-wise Kronecker product) Row-wise Kronecker product

where are rows of A, B respectively. Note that our definition is different from usual definition of Khatri-Rao product which is a column-wise Kronecker product (is performed on columns of matrices).

Tensor as multilinear form: We view a tensor as a multilinear form. Consider matrices . Then tensor is defined as

In particular, for vectors , we have

which is a multilinear combination of the tensor mode-1 fibers. Similarly is a multilinear combination of the tensor entries, and is a linear combination of the tensor slices.

Derivative: For function g(x) : with vector input , the m-th order derivative w.r.t. variable x is denoted by (which is a m-th order tensor) such that

When it is clear from the context, we drop the subscript x and write the derivative as ).

Derivative of product of two functions We frequently use the following gradient rule.

Lemma 7 (Product rule for gradient [15]) For tensor-valued functions F(x) : ) :

where the notation denotes permutation of modes of the tensor for permutation vector + 2+ 3+ 1+ 1]. This means that the (+ 1)mode is moved to the last mode.

B Proof of Theorems 4, 5

B.1 Proof Theorem 4

The underlying idea behind the proof comes from Theorem 1. By Theorem 1 we have that

In order to show the derivative form more easily, let us look at derivative of each entry ] of the vector .

and hence the form follows. By Theorem 1 we have that

In order to show the derivative form more easily, let us look at each entry ] of the vector .

The form follows directly using the derivative rule as in Lemma 7.

Now when we reshape the above tensor T to Reshape([1 2], [3 4]]) we have the form

Remark: Difference from IID case: Note that Lemma 2 is different from Theorem 1. Lemma 2 is specific to RNNs while Theorem 1 is from [Janzamin et al 2014] for IID samples. The score function in the two results are different; Note that in Lemma 2 score function ) in Equation (3) is defined as partial derivative (of order m) of joint pdf ) w.r.t. which is a m-th order tensor. When this form is utilized in Steins form in Lemma 2, we obtain partial derivatives of w.r.t. which is a function of in expectation. Note that the expectation is w.r.t. all variables and thus, the dependence to is also averaged out since it is a function of . To provide more steps: let ]. Using law of total expectation and Theorem 1, we have

Since ) where is only a function of , the result follows. Again note that the partial derivative on the RHS is only w.r.t. , while the expectation is for all . The crucial point is G is a function of and not just . It is the use of partial derivatives that allows us to carry out this operation. This is also a novel contribution of this paper and does not follow directly from score function result in [Janzamin et al 2014].

B.2 Proof of Theorem 5

Hence,

The second Equation is direct result of Lemma 4. Therefore, if we decompose the above tensor, the first mode yields the matrix . Next we remove the effect of by multiplying its inverse to the first mode of the moment tensor T . By the above Equations, we readily see that

This means T ((, where

recovers . Recovery of U, V directly follows Lemma 8.

C GLOREE for General Polynomial Activation Functions

C.1 GLOREE for IO-RNN with polynomial activation functions

Here, we consider an RNN with polynomial activation function of order 2, i.e.,

We have the following properties.

Theorem 8 (Learning parameters of RNN for general polynomial activation function) The following is true:

Hence, we can recover via tensor decomposition assuming that they are full row rank. In order to learn U we form the tensor )]. Then we have

Hence, we can recover ) via tensor decomposition under full row rank assumption. Since is previ- ously recovered, U can be recovered. Thus, Algorithm 3(GLOREE) consistently recovers the parameters of IO-RNN with polynomial activations.

Remark on form of the cross-moment tensor: The cross-moment tensor in Equation (17), is a tensor of order l + 1, where modes 2, . . . , l + 1 are similar, i.e., they all correspond to rows of the matrix ), where has gone through row-wise Kronecker product l times . This is a direct extension of the form in Theorem 4 from 2.

For the cross-moment tensor in (8), the coefficients are the expected values of derivatives of activation function. More concretely, if activation is a polynomial of degree l, we have that poly, where polydenotes a polynomial of degree 2. Similarly, the coefficients of the tensor decomposition in (17) correspond to expectations over derivatives of (recursive) activation functions. We assume that these coefficients are non-zero in order to recover the weight matrices.

Remark on tensor decomposition via sketching: Consider line 10 in Algorithm 3 and line 7 in Algorithm 4. Here we are decomposing a tensor of order l + 1. In order to perform this with efficient computational complexity, we can use tensor sketching proposed by Wang et al. [29]. They do not form the moment tensor explicitly and directly compute tensor sketches from data. This avoids the exponential blowup in computation, i.e., it reduces the computational complexity from to (m + m log m)n, where m is the sketch length and n denotes the number of samples. As expected, there is a trade off between the sketch length and the error in recovering the tensor components. For details, see [29].

Proof: By Theorem 1, we have that

The form follows directly using the derivative rule as in Lemma 7.

Thus, we provide an efficient framework for recovering all the weight matrices of an input-output recurrent neural network using tensor decomposition methods.

C.2 GLOREE-B for BRNN with general polynomial activation function

In Algorithm 4, we show the complete algorithm for training BRNN when the activation functions are polynomials of order l. The analysis directly follows from analysis of BRNNs with quadratic activation functions and the extension to 2 is similar to extension of IO-RNNs with quadratic activation functions to general polynomials of order 2.

D Sample complexity analysis Proofs

In this Section, we provide the proofs corresponding to sample complexity analysis of Section 4.4 in the paper. We also elaborate on the Assumptions 2(d) and 2(e) and discuss that in fact we need weaker assumptions.

We now analyze the sample complexity for GLOREE. We first start with the concentration bound for the moment tensor and then use analysis of tensor decomposition to show that our method has a sample complexity that is a polynomial of the model parameters.

Concentration bounds for functions over Markov chains: Our input sequence x[n] is a geometrically ergodic Markov chain. We can think of the empirical moment ˆ)] as functions over the samples x[n] of the Markov chain. Note that this assumes h[n] and y[n] as deterministic functions of x[n], and our analysis can be extended when there is additional randomness. Kontorovich and Weiss [18] provide the result for scalar functions and this is an extension of that result to matrix-valued functions.

We now recap concentration bounds for general functions on Markov chains. For any ergodic Markov chain with the stationary distribution , denote ) as state distribution given initial state . The inverse mixing time is defined as follows

where 1 is geometric ergodicity and 0 1 is the contraction coefficient of the Markov chain.

In the IO-RNN (and BRNN) model, the output is a nonlinear function of the input. Hence, the next step is to deal with this non-linearity. Kontorovich and Weiss [18] analyze the mixing of a (scalar) nonlinear function through its Lipschitz property. In order to analyze how the empirical moment tensor concentrates, we define the Lipschitz constant for matrix valued functions.

Definition 3 (Lipschitz constant for a matrix-valued function of a sequence) A matrix-valued function Φ :

where ] are any two possible sequences of observations. Here denotes the spectral norm and is the state space for a sequence of n observations x[n].

Concentration of empirical moments of IO-RNN and BRNN: In order to ensure that the empirical moment tensor has a bounded Lipschitz constant, we need Assumptions 1(a)-1(b) and 2(a)-2(e). Then, we have that

Lemma 9 (Lipschitz property for the Empirical Moment Tensor) For the IO-RNN discussed in (15), if the Assumptions 1(a)-1(b), 2(a)-2(e) hold, the matricized tensor Matis a function of the input sequence with Lipschitz constant

For proof, see Appendix D.1. Given this Lipschitz constant, we can now apply the following concentration bound. Now that we have proved the Lipschitz property for the cross-moment tensor, we can prove the concen-

tration bound for the IO-RNN.

Theorem 10 (Concentration bound for RNN) For the IO-RNN discussed in (15), let z[n] be the sequence of matricized empirical moment tensors Matfor ]. Then,

with probability at least 1 , where E(z) is expectation over samples of Markov chain when the initial distribution is the stationary distribution and c is specified in Equation (18).

For proof, see Appendix D.2.

D.1 Proof of lemma 9: Lipschitz property for the Empirical Moment Tensor

In order to prove lemma 9, we first need to show that the matricized cross-moment tensor is a Lipschitz function of the input sequence and find the Lipschitz constant.

We first show that the output function is a Lipschitz function of the input sequence and find the Lipschitz constant. In order to prove this, we need the above assumptions to ensure a bounded hidden state and a bounded output sequence. Then, we have that

Lemma 11 (Lipschitz property for the Output of IO-RNN) For the IO-RNN discussed in (15), if the above assumptions hold, then the output is a Lipschitz function of the input sequence with Lipschitz constant

Proof: This follows directly from the definition. In order to find the Lipschitz constant, we need to sum over all possible changes in the input sequence [19]. Therefore, we bound derivative of the function w.r.t. each input entry and then take the average the results to provide an upper bound on the Lipschitz constant. With the above assumptions, it is straightforward to show that

Taking the average of this geometric series for ] and large sample sequence, we get as the Lipschitz constant. Next we want to find the Lipschitz constant for the matricized tensor )] which is a function of the input sequence. We use the Assumptions 1(a)-1(c) and 2(a)-2(e) from Section 4.4. Considering the rule for derivative of product of two functions as in Lemma 7 we have that

Hence,

the last inequality follows from definition of first order Markov chain. and assuming that is bounded and the each of the above derivatives is bounded by some value . and we conclude that Mat ()]) is Lipschitz with Lipschitz constant

Now that we have proved the Lipschitz property for the cross-moment tensor, we can prove the concentration bound for the IO-RNN.

D.2 Proof of Theorem 10

In order to get the complete concentration bound in Theorem 10, we need Lemma 9 in addition to the following Theorem.

Theorem 12 (Concentration bound for a matrix-valued function of a Markov chain] Consider a Markov chain with observation samples , geometric ergodicity G, contraction coefficient and an arbitrary initial distribution. For any c-Lipschitz matrix-valued function Φ() : , we have

with probability at least 1 , where E(Φ) is expectation over samples of Markov chain when the initial distribution is the stationary distribution.

Proof: The proof follows result of [19], [17], and Matrix Azuma theorem (which can be proved using the analysis of [28] for sum of random matrices). The upper bound can be decoupled into two parts, (1) [Φ]where the expectation is over the same initial distribution as used for Φ and (2) the difference between E[Φ] for the case where the initial distribution is the same initial distribution as used for Φ and the initial distribution being equal to the stationary distribution. It is direct from analysis of [18] that the latter is upper bounded by . The former can be bounded by Theorem 13 below and hence Theorem 12 follows.

Theorem 13 (Matrix Azuma [7]) Consider Hidden Markov Model with finite sequence of n samples as observations given arbitrary initial states distribution and c-Lipschitz matrix valued function Φ : , then

Proof: This proof is from [7] and is repeated here for completeness. Theorem 7.1 [28] provides the upper confidence bound for summation of matrix random variables. Consider a finite sequence of matrices Ψ. The variance parameter is the upper bound for [Ψand we have that

with probability at least 1. For function Φ, we define the martingale difference of function Φ as the input random variable with arbitrary initial distribution over states.

where is the subset of samples from i-th position in sequence to j-th one. Hence, the summation over these set of random variables gives ] ) [Φ], E[Φ] is the expectation with the same initial state distribution.

Then it remains to find which is the upper bound for (Φ; for all possible sequences. Define MD

Considering the analysis of tensor decomposition analysis in [14], Theorem 10 implies polynomial sample complexity for GLOREE. The sample complexity is a polynomial of (, Detailed proof is similar to analysis in [14], [7]. Note that with similar analysis we can prove polynomial sample complexity for GLOREE-B.

D.3 Remark on Assumptions 2(d), 2(e)

As you saw in the proof of lemma 9, in order to prove the Lipschitz property for moment tensor, we need the following:

By result of lemma 11, in order to prove the O(1/n) for the first term we need to be a bounded value.

As we showed in Appendix D.1, Assumptions 2(d), 2(e) suffice to prove the bound for the second term too. However, that is not a necessary assumption. We provide an example here.

Example: Assumptions (2d)-(2e) can be replaced by the following:

(2e*) For those values of i such that ] is nonzero, it is bounded by some constant which is specified by pdf of the input.

It can be readily seen from the proof in Appendix D.1 that if we replace Assumptions 2(d), 2(e), we can still prove the result.

In this case, the 3in bound for c will be replaced by . Note that first order Markov chain is a special case of this example.

E Discussion

E.1 Score Function Estimation

According to [15], there are various efficient methods for estimating the score function. The framework of score matching is popular for parameter estimation in probabilistic models [13, 27], where the criterion is to fit parameters based on matching the data score function. Swersky et al. [27] analyze the score matching for latent energy-based models. In deep learning, the framework of auto-encoders attempts to find encoding and decoding functions which minimize the reconstruction error under added noise; the so-called Denoising AutoEncoders (DAE). This is an unsupervised framework involving only unlabeled samples. Alain and Bengio [1] argue that the DAE approximately learns the first order score function of the input, as the noise variance goes to zero. Sriperumbudur et al. [26] propose non-parametric score matching methods that provides the non-parametric score function form for infinite dimensional exponential families with guaranteed convergence rates. Therefore, we can use any of these methods for estimating ]) and use the recursive form [15].

to estimate higher order score functions.

E.2 Training IO-RNN and BRNN with scalar output

In the main text, we discussed training IO-RNNs and BRNNs with vector outputs. Here we expand the results to training IO-RNNs and BRNNs with scalar outputs. Note that in order to recover the parameters uniquely, we need the cross-moment to be a tensor of order at least 3. This is due to the fact that in general matrix decomposition does not provide unique decomposition for non-orthogonal components. In order to obtain a cross-moment tensor of order at least 3, since the output is scalar, we needs its derivative tensors of order at least 3. In order to have a non-vanishing gradient, the activation function needs to be a polynomial of order 3.

Hence, our method can also be used for training IO-RNN and BRNN with scalar output if the activation function is a polynomial of order 3, i.e., Let be the output of

In order to learn U, we form the tensor ˆReshape([1 + 1 . Then we have

where is the row-wise Kronecker product defined in 1. Hence, since we know , we can recover U via tensor decomposition.

We can prove that we can learn the parameters of a BRNN with scalar output and polynomial activation functions of order 3 using the same trend as for Lemma 5.

E.3 Training Linear IO-RNN

In the paper we discussed the problem of training IO-RNNs with polynomial activation function of order 2. Here we propose a method for training IO-RNNs with linear activation functions. Although our proposed methods for two cases differ in nature, we include both of them for completeness and covering all cases.

Sedghi and Anandkumar [23] provide a method to train first layer of feed-forward neural networks using the first-order score function of the input. For a NN with vector output, their formed cross-moment is a matrix of the form where is the weight matrix for the first layer and B is a matrix that includes the rest of the derivative matrix. Then they argie that if is sparse, the problem of recovering is a sparse dictionary learning problem that can be solved efficiently using Sparse Dictionary Learning Algorithm [25].

Here we show that for IO-RNN with linear activation function, we can expand the result of Sedghi and Anandkumar [23] to the non-i.i.d. input sequence.

Let

For our linear model the derivative has a Toeplitz form. Assuming that is sparse, we can use this structure and Sparse Dictionary Learning Algorithm [25] to recover the model parameters. Below we write the cross-moment Topelitz form for n = 4 for simplicity.

If we recover the Toeplitz structure, we have access to the following matrices: . Next, we put these matrices in a new matrix C as below.

It is easy to see that for matrix B as shown below

Now assuming that is sparse and B is full column-rank, we can recover using Sparse Dictionary Learning Algorithm [25].

Let be the singular-value decomposition of ). It is easy to show that, to ensure that B is full column-rank, we need the singular-values of U to satisfy . Once we recover , we can recover and .

F Spectral Decomposition Algorithm

As part of GLOREE, we need a spectral/tensor method to decompose the cross-moment tensor to its rank-1 components. Refer to notation for definition of tensor rank and its rank-1 components. As depicted in notation, we are considering CP decomposition. Note that CP tensor decomposition for various scenarios is extensively analyzed in the literature [24], [2], [5], [6], [15], [14]. We follow the method in [14].

The only difference between our tensor decomposition setting and that of [14] is that they have a symmetric tensor (i.e., ˆ) whereas in GLOREE, we have two asymmetric tensor decomposition procedures in the form of ˆ. Therefore,

1 We first make a symmetric version of our tensor. For our specific case, this includes multiplying the first mode of the tensor with a matrix D, such that ˆ. We use the rule presented in [3] to form the symmetrization tensor. For example for ˆ)], we use ˆ)]].

2 Next, we run the tensor decomposition procedure as in [14] to recover estimates of ˆ]. The steps are shown in Figure 2. For more details, see [14].

3 The last step includes reversing the effect of symmetrization matrix D to recover estimate of ˆ]. For more discussion on symmetrization, see [3].

Our overall algorithm is shown in Algorithm 5.

Remark on tensor decomposition via sketching: Consider line 10 in Algorithm 3 and line 7 in Algorithm 4. Here we are decomposing a tensor of order l + 1. The tensor decomposition algorithm for third order tensor readily generalizes to higher order tensors. In order to perform this with efficient computational complexity, we can use tensor sketching proposed by Wang et al. [29]. They do not form the moment tensor explicitly and directly compute tensor sketches from data. This avoids the exponential blowup in computation, i.e., it reduces the computational complexity from to (m + m log m)n, where m is the sketch length and n denotes the number of samples. As expected, there is a trade off between the sketch length and the error in recovering the tensor components. For details, see [29].

Figure 2: Overview of tensor decomposition algorithm for a symmetric third order tensor [14].

designed for accessibility and to further open science