b

DiscoverSearch
About
My stuff
On the Initialization of Long Short-Term Memory Networks
2019·arXiv
Abstract
Abstract

Weight initialization is important for faster convergence and stability of deep neural networks training. In this paper, a robust initialization method is developed to address the training instability in long short-term memory (LSTM) networks. It is based on a normalized random initialization of the network weights that aims at preserving the variance of the network input and output in the same range. The method is applied to standard LSTMs for univariate time series regression and to LSTMs robust to missing values for multivariate disease progression modeling. The results show that in all cases, the proposed initialization method outperforms the state-of-the-art initialization techniques in terms of training convergence and generalization performance of the obtained solution.

Keywords: Deep neural networks, long short-term memory, time series regression, initialization, disease progression modeling.

Recurrent neural networks (RNNs) are the state-of-the-art nonparametric methods for sequence learning that map an input sequence to an output sequence by predicting the next time steps. RNN training using the backpropagation through time algorithm is challenging due to vanishing and exploding gradients where the norm of the backpropagated error gradient can increase or decrease exponentially, hindering the network in capturing long-term dependencies [1].

Three main solutions have been proposed in the literature to improve RNN training; modifications of the training algorithm, modifications of the network architecture, or different weight initialization schemes. In the first approach, advanced optimization techniques such as the Hessian-Free method [2] or regularized loss functions [3] are applied to improve the backpropagation through time algorithm for learning long sequences. The second approach is to properly initialize the RNN weight matrices, for example, to be identity [4] or orthogonal [5], to find solution to the long-term dependency problem. The third approach is to employ nonlinear reset units in the RNN architecture to store information for a long time, for instance, using long short-term memory (LSTM) networks [6] or gated recurrent units (GRUs) [7].

LSTM networks, the most successful type of RNNs, use a gated architecture to replace the hidden unit with a memory cell to efficiently capture long-term temporal dependencies by storing and retrieving sequence information over time. The memory cell is used as a feedback along with three nonlinear (multiplicative) reset units to keep the backpropagated error signal constant. The input and output gates of the cell learn their weights to incorporate the stored information or to control the output values. There is also a forget gate that learns to remember or forget the memory information over time by scaling the cell content. Therefore, in contrast to vanilla RNNs, LSTM units by design allow gradients to flow unchanged, but they can still suffer from instabilities (exploding gradient problem) when trained on long sequences [8].

In this paper, a simple, yet robust initialization method is proposed to tackle the training instabilities in LSTM networks. The idea is based on normalized random initialization of the network weights with the property that the input and output signals have the same variance. The proposed method is applied to standard LSTMs [1,9] for univariate time series regression using data from the UCR Time Series Archive [10] and to LSTMs robust to missing values [11] for multivariate disease progression modeling in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort [12] using volumetric magnetic resonance imaging (MRI) measurements.

Since deep neural network training is achieved by solving a nonconvex optimization problem, mostly in a stochastic way, a random weight initialization scheme is important for faster convergence and stability. Otherwise, the magnitudes of the input signal and error gradients at different layers can exponentially decrease or increase, leading to an ill-conditioned problem. Standard initialization of weights with zero-mean uniform/Gaussian distributions and heuristic variances ranging from 0.001 to 0.01 or an input layer size (N) dependent variance of 1/(3N) have been widely used in previous studies [13]. But, studies on the initialization, for instance, using unsupervised pre-training [14], showed its importance as a regularizer for the optimization procedure to robustly reach a local minimum and to improve generalization.

Accordingly, training difficulties have been investigated based on the variance of the responses in each layer, when the singular values of the Jacobian are not unit, and a normalized initialization of uniform weights with a variance of 1/N is suggested assuming that the activation functions are identity and/or hyperbolic tangent [13]. Likewise, a scaled initialization method has been developed to train deep rectified models from scratch using zero-mean Gaussian weights whose variances are 2/N [15].

To resolve the long-term temporal dependencies problem in RNNs, which can be seen as deep networks when unfolded through time, the (scaled) identity matrix has been applied to initialize the hidden (recurrent) weights matrix to output the previous hidden state in the absence of the current inputs in RNNs composed of rectified linear units (ReLU) [4]. Alternatively, (nearly) orthogonal matrices [5] and scaled positive-definite weight matrices [16] have been used to address vanishing and exploding gradients in RNNs by preserving the gradient norm during backpropagation.

As it can be seen, different initialization methods have been proposed to deal with the training convergence problem in deep neural networks including RNNs, assuming that LSTMs by design can handle the issue. Hence, the abovementioned initialization methods, e.g., orthogonal recurrent weight matrices and current input weight matrices, both drawn i.i.d. from zero-mean Gaussian distributions with variances of 1/N, have also been applied to LSTMs. However, as noted before, LSTMs can still suffer from instability with improper initialization due to the stochastic nature of the optimization and using multiplicative gates and feedback signals.

To address training instability and slow convergence in LSTMs, we propose a scaled random weights initialization method that aims to keep the variance of the network input and output in the same range. Let’s  xtj ∈ RN×1be the j-th observation of an N-dimensional input vector at time t. The feedforward pass of an LSTM network can be expressed as

image

where  {f tj, itj, ztj, ctj, otj, htj} ∈ RM×1 are the j-th observation of forget gate, input gate, modulation gate, cell state, output gate, and hidden output at time t, respectively, and M is the number of output units. Also,  {Wf, Wi, Wc, Wo} ∈ RM×Nare weight matrices containing the connecting weights from input  xtj to the gates and cell,  {Uf, Ui, Uc, Uo} ∈ RM×M are weight matrices containing the connecting weights from recurrent input  ht−1jto the gates and cell,  {bf, bi, bc, bo} ∈ RM×1denote the corresponding biases of neurons, and  ⊙is the Hadamard product. Finally,  σg, σc, and σhare nonlinear activation functions allocated to the gates, input modulation, and hidden output, respectively. Note that, in a regression problem, M = N, and  ht−1jis an estimation of  xtj. The regression assumptions can still be applied to sequence-to-sequence or sequence-to-label learning problems simply by adding a fully-connected layer with N input nodes and a desired number of output units.

Assume that all of the weight matrices are independently initialized with zero-mean i.i.d. random values obtained from a symmetric distribution. The goal is to derive the condition(s) on the initialization of the weights to achieve Var(htj) = Var(xtj). Since the weights are independent from the input, assuming an exact estimation for the recurrent value, i.e.,  ht−1j = xtj, and mutually independent zero-mean input features – sharing the same distribution, the variance of the forget gate can be calculated as

image

where  wfand  ufare the elements of  Wfand  Uf, respectively. The bias in the variance calculation is canceled out as it is an independent constant initialized to zero. Moreover, the second equality holds under the assumption that  σgis an identity function. We will discuss other commonly used functions in LSTM units in Section 3.2.

Variance calculations for the input, modulation, and output gates can be performed in a similar way to the forget gate. That is to say,

image

where  wi, ui, wc, uc, wo, and uoare the elements of  Wi, Ui, Wc, Uc, Wo, and Uo,respectively.

The cell state formula is a form of the stochastic recurrence equation [17], also known as growing perpetuity, in which the moments of the cell state are time varying. Therefore, one tractable way to stabilize the network training is to set Var(ctj) = Var(ct−1j). Accordingly,

image

where the above equation is obtained based on the zero-mean assumption and independence assumption between all of the gates and the cell state to avoid terms containing covariance matrices in the last expression. Also, note that 0  < Var(f tj) < 1.

Finally, the variance of the network output is computed as

image

where the last equality is obtained assuming that there is an identity activation function and independence between the output gate and the cell state. Considering all of the calculated variances and setting Var(htj) = Var(xtj) = 1, the requiredcondition can be summarized as

image

where the right hand side of the above equation is the multiplication of the weights connected to the input, modulation, and output gates.

Similar to the feedforward pass, some initialization conditions can be derived to ensure that the variance of the backpropagated gradient remains unchanged, i.e., Var(∂L/∂htj) = Var(∂L/∂xtj) where L ∈ Ris the loss function defined based on the actual target and network output. However, as shown in [13] and [15], initialization with properly scaling the forward signal is equivalent to initialization with properly scaling the backward signal, and since the number of units in the input and output of the LSTM network are the same, similar conditions for weight initialization using backpropagation will be obtained.

3.1 Peephole Connections

In general, LSTMs can be extended to augment their internal cell state to the multiplicative gates using the so-called peephole connections. These cell-to-gate connections allow the gates to inspect the current cell state even if the output gate is closed, and consequently help improving the performance, especially when the task involves a precise duration of intervals [9]. The feedforward pass of the peephole LSTM can be formulated as

image

where  {Vf, Vi, Vo} ∈ RM×M are diagonal peephole weight matrices. Hence, each gate will only look at its corresponding cell state. To achieve Var(htj) = Var(xtj),all the assumptions applied to the traditional LSTM are used for the peephole LSTM. Assuming that the peephole matrices are independent from the input and the cell state and are independently initialized with zero-mean i.i.d. random values obtained from a symmetric distribution, the variances can be calculated as

image

where  vf, vi, and  voare the diagonal elements of  Vf, Vi, and  Vo, respectively. Merging Equations (5) and (7) under the assumption that Var(htj) = Var(xtj) = 1results in a quadratic equation that can be expressed as

image

where  β01 = −1, β11 = N (Var(wo) + Var(uo)), and  β21 = Var(vo). Since the discriminant  ∆1 = β211 −4β21β01 is always positive considering nonzero variances, there are two possible solutions for Equation (8): Var(ctj) = (−β11±√∆1)/(2β21).However, since  β21 >0 and  β01 <0, with a positive discriminant and based on the sign of the product of the roots (β01/β21), one of the real solutions would be negative, which cannot be accepted as Var(ctj) >0. Therefore, the desired solution to Equation (8) will be obtained as

image

Likewise, combining Equations (2) to (4) and (6) using the same assumptions leads to another quadratic equation that can be written as

image

where  β02 = N 2 (Var(wi) + Var(ui)) (Var(wc) + Var(uc)), β22 = Var(vf), and β12 = NVar(vi) (Var(wc) + Var(uc))+N (Var(wf) + Var(uf))−1. The two possible solutions for Equation (10) will be obtained as Var(ctj) = (−β12±√∆2)/(2β22),where  ∆2 = β212 − 4β22β02is the discriminant of the equation. Here, since β02, β22 >0, assuming a nonnegative discriminant and based on the sign of the sum and product of the roots (−β12/β22and  β02/β22), both real solutions could be positive and acceptable provided that  β12 <0. However, to achieve a simple solution for initialization, one can set  ∆2 = 0 and β12 <0 which produces repeated real positive roots for the problem. Therefore, the real solution to Equation (10) can be obtained as

image

Finally, conditions for the existence of a common solution to Equations (8) and (10) can be obtained using Equations (9) and (11) as follows

image

�4N 2Var(vf) (Var(wi) + Var(ui)) (Var(wc) + Var(uc)) =�N 2 (Var(wo) + Var(uo))2 + 4Var(vo) − N (Var(wo) + Var(uo)) .

3.2 Nonlinear Activation Functions

All the abovementioned equations are obtained based on the assumption that the activation functions are identity functions. In general, symmetric functions with zero intercepts such as the identity and hyperbolic tangent are suggested for  σhand  σc, respectively, and logistic sigmoid is suggested for  σg[9]. Both the hyperbolic tangent and logistic sigmoid are nonlinear symmetric functions that can be linearly approximated using a Taylor series expansion. The former has a zero intercept and its expansion about zero leads to an identity function (σc(x) ≈ x). The latter, however, has a nonzero intercept and its Taylor series about zero is approximated as  σg(x) ≈ 0.5 + 0.25x. Therefore, the sigmoid function approximately increases the input signal mean by 1/2 and scales its variance by 1/16. Note that the nonzero mean value of the sigmoid can induce important singular values in the Hessian matrix, resulting in saturation of the top layers and prohibition of gradients to flow backward to learn useful features in the lower layers [13]. Using the suggested activation functions in the gates, the variance calculations for the traditional LSTM network are updated as follows based on the aforementioned Taylor series expansion

image

where the last equation is obtained bearing in mind that Var(xy) = Var(x)Var(y)+ E2(x)Var(y) + E2(y)Var(x) for two independent random variables x and y, and considering  E(ztj) = 0, E(f tj) = E(itj) = 0.5, and, hence,  E(ctj) = E(ct−1j ) = 0.Finally, the updated rule for initialization of a traditional LSTM network using Equation (7) can be written as

Applying the same suggested functions in the peephole LSTM network generalizes the variance calculations as follows

image

Here also using Equation (7), two quadratic equations can be obtained similar to Equations (8) and (10), where  β01 = −16,  β11 = N (Var(wo) + Var(uo)), β21 = Var(vo), β02 = N (Var(wc) + Var(uc)) (N (Var(wi) + Var(ui))+ 4),  β22 =Var(vf), and β12 = NVar(vi) (Var(wc) + Var(uc))+N (Var(wf) + Var(uf))−12.Likewise, conditions for the existence of a common solution to Equations (8) and (10) can be obtained using Equations (9) and (11) as follows

image

3.3 Initialization Summary

The proposed initialization rule can be summarized as follows:

Standardize the input data to have a zero mean and unit variance per feature, and initialize the LSTM network biases to zero.

Initialize the weights in the weight matrices randomly using zero-mean i.i.d. Gaussian distributions with variances satisfying one of the following equations:

Equation (1), if using the traditional LSTM network based on identity or hyperbolic tangent functions.

Equation (12), if using the peephole LSTM network based on identity or hyperbolic tangent functions.

Equation (13), if using the traditional LSTM network based on identity or hyperbolic tangent for input modulation and cell activation, and logistic sigmoid functions in the gates.

Equation (14), if using the peephole LSTM network based on identity or hyperbolic tangent for input modulation and cell activation, and logistic sigmoid functions in the gates.

Note that the variances need to be selected subject to the specified conditions in the selected equation. For example, when using a peephole LSTM, and, correspondingly, Equation (12) or (14), there are eleven variances to fix, Var(vf),Var(vi), Var(vo), Var(wf), Var(uf), Var(wi), Var(ui), Var(wo), Var(uo), Var(wc),and Var(uc).

4.1 Data

Both univariate and multivariate data are used to study the effect of initialization on LSTM training.

The following three univariate datasets are obtained from the UCR Time Series Archive [10] due to having the largest training samples size: ElectricDevices with 16,637 samples (8,926 for training and 7,711 for test) of sequence length 96; FordA with 4,921 samples (3,601 for training and 1,320 for test) of sequence length 500; and Crop with 24,000 samples (7,200 for training and 16,800 for test) of sequence length 46.

The multivariate dataset, ADNI, focuses on disease progression modeling and is obtained from the ADNI cohort [12]. It constitutes yearly measurements for 383 subjects (332 for training and 51 for test) of sequence length 3 to 10 with normal cognition, mild cognition impairment, or Alzheimer’s disease. The multivariate feature set consists of T1-weighted MRI volumetric measurements of ventricles, hippocampus, whole brain, fusiform, middle temporal gyrus, and entorhinal cortex, all normalized for intracranial volume.

4.2 Experimental Setup

The proposed initialization method is assessed using a peephole LSTM [9] applied to the univariate data (N = 1) for time series regression and a peephole LSTM robust to missing values [11] applied to the multivariate data (N = 6) for disease progression modeling. In both cases, an identity function and a hyperbolic tangent are used in  σhand  σc, respectively, a logistic sigmoid is used in  σg, and the network biases are initialized to zero. Therefore, the variance selection for weight matrices is performed using Equation (14), and weight values are drawn from the zero-mean i.i.d. Gaussian distributions. Four different configurations of the variances are inspected as illustrated in Table 1.

The input data is standardized to have a zero mean and unit variance per feature dimension. Moreover, the batch size is set to 85% of training samples (15% used for validation to tune the optimization hyperparameters), and the first to penultimate time point is used to estimate the second to last time point per observation. The L2-norm is used as loss function and momentum batch gradient descent is applied to optimize the network parameters using L2 regularization. The optimization hyperparameters, i.e., the learning rate, momentum weight,

Table 1: The utilized configurations of the variances satisfying Equation (14).

image

image

Fig. 1: The training loss of the different methods applied to the univariate and multivariate datasets.

and weight decay are set to 0.1, 0.9, and 0.0001, respectively. These values were selected according to the validation set error across the different experiments.

The proposed approach is compared with two state-of-the-art initialization techniques applied to the same LSTM networks assuming zero biases and using the same optimization setup: normalized [13], all weight matrices drawn i.i.d. from zero-mean Gaussian distributions with a scaled variance of 1/N; orthogonal [5], same as normalized, but with orthogonal recurrent weight matrices drawn i.i.d. from zero-mean Gaussian distributions with a variance of 1/N.

4.3 Results

Figure 1 compares the training loss of the proposed and state-of-the-art initialization methods applied to the univariate and multivariate datasets. As can be seen, the proposed method with any configuration outperforms the prevalent initialization techniques in all experiments, either by achieving a lower loss (ElectricDevices and FordA) or by faster convergence to the same loss (Crop and ADNI).

Table 2: The generalization error (MSE) in predicting the feature values for the utilized test sets using the different initialization techniques.

image

To further investigate the influence of initialization on the performance, we also evaluate the generalization error in the test set. Table 2 reports the test mean square error (MSE) in predicting the feature values per dataset for the utilized initialization methods. As it can be deduced, the proposed initialization method with any configuration achieves superior results to the prevalent initialization approaches, which illustrates the generalizability of the proposed method.

More interestingly, the fourth configuration of the proposed method in which the recurrent weights receive more variance than the current input weights outperforms all the other methods in almost all of the experiments.

In this paper, a robust initialization method was proposed for LSTM networks to address training instability and slow convergence. The proposed method was based on scaled random weights initialization aiming to keep the variance of the network input and output signals in the same range subjected to a number of assumptions simplifying the initialization conditions. The proposed method was applied to univariate and multivariate time series regression datasets and outperformed two state-of-the-art initialization methods in all cases.

The obtained conditions can be optimized for eight or eleven unknowns using a traditional LSTM or peephole LSTM, respectively. In this work, different configurations of the variances were inspected to confirm the proposed assumption for initializing the network weights. Moreover, the proposed method can be used for sequence-to-sequence and sequence-to-label learning paradigms by connecting a fully-connected layer with a desired output size to the LSTM network output. It should also be noted that the initialization conditions need to be properly modified in case of using activation functions other than a hyperbolic tangent, identity function, or logistic sigmoid in the gates.

Acknowledgments. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sk�lodowska-Curie grant agreement No 721820.

1. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In: A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press (2001)

2. Martens, J., Sutskever, I.: Learning recurrent neural networks with Hessian-free optimization. In: Proceedings of the International Conference on Machine Learning. (2011) 1033–1040

3. Trinh, T.H., Dai, A.M., Luong, M.T., Le, Q.V.: Learning longer-term dependencies in RNNs with auxiliary losses. CoRR abs/1803.00144 (2018)

4. Le, Q.V., Jaitly, N., Hinton, G.E.: A simple way to initialize recurrent networks of rectified linear units. CoRR abs/1504.00941 (2015)

5. Vorontsov, E., Trabelsi, C., Kadoury, S., Pal, C.: On orthogonality and learning recurrent networks with long term dependencies. CoRR abs/1702.00071 (2017)

6. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8) (1997) 1735–1780

7. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. (2014) 1724–1734

8. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. (2014) 3104–3112

9. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3 (2002) 115–143

10. Dau, H.A., Bagnall, A., Kamgar, K., Yeh, C.C.M., Zhu, Y., Gharghabi, S., Ratanamahatana, C.A., Keogh, E.: The UCR Time Series Archive. CoRR abs/1810.07758 (2018)

11. Ghazi, M.M., Nielsen, M., Pai, A., Cardoso, M.J., Modat, M., Ourselin, S., Sørensen, L.: Training recurrent neural networks robust to incomplete data: Application to Alzheimer’s disease progression modeling. Medical Image Analysis 53 (2019) 39–46

12. Petersen, R.C., Aisen, P.S., Beckett, L.A., Donohue, M.C., Gamst, A.C., Harvey, D.J., Jack, C.R., Jagust, W.J., Shaw, L.M., Toga, A.W., Trojanowski, J.Q., Weiner, M.W.: Alzheimer’s Disease Neuroimaging Initiative (ADNI): clinical characterization. Neurology 74 (2010) 201–209

13. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. (2010) 249–256

14. Erhan, D., Manzagol, P.A., Bengio, Y., Bengio, S., Vincent, P.: The difficulty of training deep architectures and the effect of unsupervised pre-training. In: Proceedings of the International Conference on Artificial Intelligence and Statistics. (2009) 153–160

15. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision. (2015) 1026–1034

16. Talathi, S.S., Vartak, A.: Improving performance of recurrent neural network with ReLU nonlinearity. CoRR abs/1511.03771 (2015)

17. Buraczewski, D., Damek, E., Mikosch, T., et al.: Stochastic models with power-law tails. Springer (2016)


Designed for Accessibility and to further Open Science