Trainable back-propagated functional transfer matrices

2017·Arxiv

Abstract

Abstract

Connections between nodes of fully connected neural networks are usually represented by weight matrices. In this article, functional transfer matrices are introduced as alternatives to the weight matrices: Instead of using real weights, a functional transfer matrix uses real functions with trainable parameters to represent connections between nodes. Multiple functional transfer matrices are then stacked together with bias vectors and activations to form deep functional transfer neural networks. These neural networks can be trained within the framework of back-propagation, based on a revision of the delta rules and the error transmission rule for functional connections. In experiments, it is demonstrated that the revised rules can be used to train a range of functional connections: 20 different functions are applied to neural networks with up to 10 hidden layers, and most of them gain high test accuracies on the MNIST database. It is also demonstrated that a functional transfer matrix with a memory function can roughly memorise a non-cyclical sequence of 400 digits.

Keywords: Functional Transfer Matrices, Deep learning, Trainable functional

connections

2010 MSC: 00-01, 99-00

1. Introduction

For instance, a fully connected deep neural network usually has a weight matrix in each hidden layer, and the weight matrix, a bias vector and a nonlinear

has been done on combining neural networks with different activations, such as the logistic function [2], the rectifier [3], the maxout function [4] and the long short-term memory blocks [5]. In this work, we study neural networks from another viewpoint: Instead of using different activations, we replace weights in

between nodes are represented by using real functions, but no longer real numbers. Specifically, this work focuses on:

the theory of functional transfer neural networks, including functional transfer matrices and a back-propagation algorithm for functional connections. Section 4 provides some examples for explaining the meanings of functionally connected structures. Section 5 discusses training methods for practical use. Section 6

concludes this work and discusses future work.

2. Background on deep feedforward neural networks

networks and only mention concepts and formulae relevant to our research [6]: A feedforward neural network usually consists of an input layer, some hidden layers and an output layer. Two neighbouring layers are connected by a linear weight matrix W , a bias vector b and an activation . Given an vector x, the neural network can map it to another vector y via:

3. A theory of functional transfer neural networks

3.1. Functional transfer matrices

to represent connections between nodes. Formally, it is defined as:

where each ) is a trainable function. We also define a transfer computation “” for the functional transfer matrix and a column vector v = (such that:

Suppose that b = (is a bias vector, ) is an activation, and x = (is an input vector. An output vector y =

(can then be computed via

In other words, the ith element of the output vector is

3.2. Back-propagation45

vectors and activations to form a multi-layer model, and the model can be trained via back-propagation [6]: Suppose that a model has L hidden layers. For the lth hidden layer, is a functional transfer matrix, its function (x) is a real function with an independent variable x and r trainable parameters = 1= ()is a bias vector, ) is an activation, and = ()is an input signal. An output signal = ()is computed via:

The output signal is the input signal of the next layer. In other words, if , then = . After an error signal is evaluated by an output layer and an error function (see Section 5.2), the parameters can be updated: Suppose that = ()is an error signal of output nodes, and is a learning rate. Let

Deltas of the biases are the same as the conventional rule:

An error signal of the (1)th layer is computed via:

Please note that the computation of () can be simplified as a transfer

computation: Let be a derivative matrix in which each element is

Then the computation can be done via . Similarly, the computation of

the transfer computation.

4. Examples

4.1. Periodic functions

instance, f(x) = ) is a periodic function, where and are its amplitude, angular velocity and initial phase respectively. Thus, we can define a matrix consisting of the following function:

where and are trainable parameters. If it is an matrix, then it can transfer an N-dimensional vector (to an M-dimensional vector (such that

). It is noticeable that each is a composition of many periodic functions, and their amplitudes, angular velocities and initial phases can be updated via training. The training process is based on the following partial

4.2. Modelling of conic hypersurfaces

constructed by ellipses and hyperbolas. Recall the mathematical definition of ellipses: A ellipse with a centre () and semi-axes and can

positive number. Let = (

. The equation becomes = 0. It is noticeable that can be rewritten to

· (· (⊙

(See Eq. 3). Therefore, a model with an activation ) can be defined as

y = ξ((x (x

and y is an output. Based on the above structure, multiple ellipse decision

150)00)

100)00)

and

In the above model, and are input nodes. and are hidden nodes. z is an output node. ) is an activation. The input nodes and the hidden nodes are connected via a functional transfer matrix. The hidden nodes and the output node are connected via a weight matrix. The hidden nodes and the output node are activated via biases and the activation. Fig. 1 shows decision boundaries formed by this model: The functional transfer matrix and the hidden nodes form three ellipse boundaries (in orange, green and purple respectively). The weight matrix and the output node form a boundary which is the union of inner parts of all ellipse boundaries. In addition, this figure shows some example inputs and outputs: If the inputs are inside the decision boundaries, the output is 1. Otherwise, the output is 0. The reason why the model can construct ellipse boundaries is that its functional transfer matrix consists of the following function:

where and are trainable parameters. If the input dimension of this matrix is 2, it will construct ellipse boundaries on a plane. If its input dimension is 3, it will construct ellipsoid boundaries in a 3-dimensional space. Generally, if

Figure 1: Decision boundaries formed by Eq. (15), Eq. (16) and Eq. (17).

its input dimension is N (N >= 2), it will construct boundaries represented by (1)-dimensional closed hypersurfaces of ellipses in an N-dimensional space. This phenomenon reflects a significant difference between the a transfer matrix with Eq. (18) and a standard linear weight matrix, as the former models closed hypersurfaces, while the later models hyperplanes. In addition, to use the back-propagation algorithm to train the parameters, derivatives are computed via:

also hyperbolas. The adapted function is defined as:

where is initialised as 1 or -1. Please note that is a constant, but NOT a trainable parameter. Given a 1 2 matrix with this function:

If it is initialised by:

then it can represent an ellipse boundary. On the other hand, if it is initialised

by:

then it can be used to form an hyperbola boundary. More generally, in an

N-dimensional space, an functional transfer matrix with Eq. (22) can represent M different (1)-dimensional conic hypersurfaces.

4.3. Sleeping weights and dead weights60

“sleep”. The reason is that the connection from Node j to Node i is a real weight . Let denote a state of Node j. Node i will receive a signal . When = 0 and = 0, the signal always influences Node i, regardless of whether or not Node i needs it. To enable the connections to “sleep” temporarily, the following function is used in a functional transfer matrix:

In the above function, and are trainable parameters, and is a constant which is initialised as 1 or -1 before training. It also makes use of the rectifier [3]:

The rectifier makes the function be able to “sleep”. In other words, when 0, ) must be zero. To use the back-propagation algorithm

to train the parameters, derivatives are computed via:

word “die” is different from the word “sleep”, as the former means that ) = 0 for all x, and the later means that ) = 0 for some x. If the connection from Node j to Node i is dead, then any change of Node j does not influence Node i. This function is:

In the above function, is a trainable parameter, and is a constant which is initialised as 1 or -1 before training. To use the back-propagation algorithm to train the parameters, derivatives are computed via:

It is noticeable that ),

means that the function does not transfer any signal and cannot be updated in

this case. In other words, the function is dead once is updated to a negative value. A matrix with this function can then represent a partially connected neural network, as dead functions can be considered as broken connections after training. A problem about the function is that it is not able to “revive”. In other words, once is updated to a negative value, it has no chance to be non-negative again. To solve this problem, an adapted version of the rectifier is used:

For the computation of

We use “” instead of “=” because this derivative is not a mathematically sound result, but a predefined value. Thus, derivatives of ) are computed

It is noticeable that

can be updated when it is negative. Thus, although the function is dead

when is negative, it can be revived by updating to a non-negative value.

4.4. Sequential Modelling via Memory Functions

sequences. For instance, given a sequential input , the tth

state (t = 1) of a memory function ) is computed via:

where and are trainable parameters. In particular, is a memory cell and its initial state is set to 0. To use the back-propagation algorithm to train the parameters, derivatives are computed via:

signals from a previous state, as each memory function records one signal. On the other hand, a standard recurrent neural network usually uses hidden units to record signals from the previous state. If it has N input units and M hidden units, then it can record M signals. It is noticeable that the functional transfer

recurrent neural network.

5. Practical training methods for functional transfer neural networks

neural networks. These methods are also used in the experiments (Section 6).

Figure 2: The general structure of functional transfer neural networks.

5.1. The model structure and initialisation75

They usually have an input layer, one or more hidden layers and an output layer. Each hidden layer consists of a functional transfer matrix and some hidden units with a bias vector and an activation function. In particular, the activation function can be the logistic sigmoid function [3]:

the rectified linear unit (ReLU)

or the hyperbolic tangent (tanh) function

The output layer consists of a linear weight matrix and N output units with a bias vector and a softmax function:

initialised based on the following method:

5.2. Training a single hidden layer

layer, and an output signal is generated. Then an error signal is computed by comparing the output signal with a target signal. The comparison is based on the cross-entropy criterion [9]. Next, the error signal is back-propagated through the output layer and the hidden layer, and deltas of parameters are computed.

the rules described by Section 3.2 are used. Finally, the parameters are updated according to their deltas. The above process can be combined with stochastic gradient descent [10].

5.3. Layer-wise supervised training and fine-tuning

number of hidden layers increases. To resolve this problem, the combination

Figure 3: Layer-wise supervised training and fine-tuning. Functional connections are denoted by solid lines, and linear weight connections are denoted by dashed lines.

of layer-wise supervised trainingand fine-tuning is used [11], which is shown by Fig. 3. Layer-wise training includes the following steps: Firstly, the first hidden layer is trained, while the other hidden layers and the output layer are

and a bias vector) is added onto it, and the method for training a single hidden layer (described by Section 5.2) is used to train it. Then the softmax layer is removed, and the second hidden layer is added onto the first hidden layer. To train the second hidden layer, another new softmax layer is added onto it, and

is not updated in this step. Finally, all remaining hidden layers are trained by using the same method. In particular, the trained softmax layer on the last hidden layer is considered as the output layer of the whole neural network. After layer-wise training, the whole neural network can be further optimised via

For the hidden layers, the rules described by Section 3.2 are used. In practice, learning rates for fine-tuning are smaller than those for layer-wise training.

6. Experiments

6.1. MNIST handwritten digit recognition

6.1.1. Experimental settings115

transfer neural networks. ) in the functional matrix (See Eq. (2)) is substituted for these example functions. In particular, and are used to denote trainable parameters instead of using . These functions

Section 4.4: F06, F07, and F08 have been discussed in Section 4.3; F12 has been discussed in Section 4.1; F19 and F20 have been discussed in Section 4.2. They also include functions which have not been discussed before, such as F01 - F05, F09 - F11 and F13 - F18. We test these examples to explore the

be useful, but it is difficult to cover all possible functions in this paper. On the other hand, some kinds of functions may not work. For instance, we have found that ) = ) does not work, because the range of f(x) = cosh(x) is [1), which means that this function does not have a

range to [0), as indicated by F15. For another instance, f(x) = tan(x) does not work, because it is a discontinuous function. This is the reason why F12 cannot be changed to ) = ). For the same reason, some other discontinuous functions, including f(x) = cot(x), f(x) = sec(x),

used in functional networks.

Section 5.1. In particular, the input layer has 784 nodes, and the output layer has 10 nodes. They correspond to input pixels and output labels of the MNIST

layer has 128 hidden units with the logistic activation, the ReLU or the tanh activation. Parameters of functional transfer matrices in these hidden layers

Table 1: Example Functions Used to Model Functional Networks.

Table 2: Initialisation of Transfer Matrices.

are initialised based on Table 2. All training processes make use of the layer-wise supervised training and fine-tuning strategy discussed in Section 5.3. In

and the learning rate is set to a constant 2, where . For fine-tuning, the Newbob+/Train strategy is used [12]: The learning rate is set to 2initially, and it will be halved if the improvement of the cross-entropy loss is smaller than 10. The training stops when the improvement is smaller

fine-tuning, the size of mini-batches is set to 16.

6.1.2. Results

include accuracies after layer-wise training (LWT), accuracies after fine-tuning

In addition, F01 - F20 are IDs of functional transfer matrices, and Logistic,

Table 3: Accuracies (%) and the Best — 1 Hidden Layer.

ReLU and Tanh are activations of hidden units. It is noticeable that most models are trained successfully, except the F11-ReLU model. Although this model fails to be trained, its counterparts, including the F11-Logistic model

best is different for different functions and activations, which means that the initial learning rate should be carefully selected for different models.

8 models failed to be trained, which means that training is becoming more

are 52 models trained successfully, which demonstrates that the revised back-propagation rules for functional transfer matrices are able to work for deep models. Among these models, 14 models obtain a better accuracy (after fine-tuning) than their counterparts in Table 3, whereas 38 models obtain a worse

better than less hidden layers, and vice versa. In addition, for a certain function, the best for ReLU is usually smaller than that for Logistic, and the best for Tanh is always smaller than that for Logistic. A similar phenomenon about has also appeared in Table 3.

are 11 models failed to be trained and 49 models trained successfully, which means that the functional transfer matrices still can work when there are up to 10 hidden layers, but training has become more difficult. Among the models trained successfully, 12 models obtain a better accuracy (after fine-tuning) than

obtains the same accuracy. These results show again that increasing the number of hidden layers can (but not always) bring about a worse accuracy. In addition, a comparison among the values in Table 3, Table 4 and Table 5 reveals that the best values are influenced by the functional transfer matrices, the activations

Table 4: Accuracies (%) and the Best — 5 Hidden Layers.

Table 5: Accuracies (%) and the Best — 10 Hidden Layers.

Therefore, tuning of is still required in practice.

6.2. Memorising the digits of π

6.2.1. Experimental settings

evaluated by using the same methodology as the previous experiments. In this part, we explore if functional transfer neural networks with the memory function are able to model the sequence of the circumference ratio= 3.141592653589793238 . Specifically, given the first 1 digits, the neural

input layer, a hidden layer and an output layer: The input layer has 10 nodes which correspond to digits 09. The hidden layer has 128, 256, 512 or 1024 nodes, and it consists of a functional transfer matrix with the memory function, a bias vector and a logistic activation. The output layer has 10

matrix, a bias vector and a softmax activation. All trainable parameters in the two matrices are initialised as a real number in [1]. All biases are initialised as zero. The training method described by Section 5.2 is used to train the models. Each training epoch makes use of a sequence of training pairs

hot encodings of the ith and (i + 1)th digits of respectively. is an input, and is a target of output. D is set to 200, 400 or 800. The number of training epochs is set to 5000. The learning rate is set to 2, and it is not changed during the whole training process. Model performance is evaluated by

Figure 4: Learning curves of functional transfer neural networks with the memory function.

6.2.2. Results

model is denoted by “xH-yD”, where x is the number of hidden units, and y the number of training pairs. It is noticeable that all 200D models gain high

and 1024H-400D models gain high accuracies after the first 2500 training epochs, which means that matrices with the memory function are able to model the sequence of digits. However, the 128H-400D model does not gain an accuracy as high as its counterparts, which means that its memory is restricted by the

of hidden units should increase, because it determines the size of the functional transfer matrix and determines consequently the number of memory functions. For the 800D models, it is clear that the increase of the number of hidden units brings about better accuracies, though the accuracies are not significantly high.

7. Related work225

deep neural networks and neurons are revised, and comparisons between them

and our work are carried out.

7.1. Functional-link neural networks

patterns [13]. Usually, they consist of a functional expansion module, an input layer and an output layer, and the functional expansion module consists of many functional-links: For instance, a functional-link can be a trainable linear function (which is called a random variable functional-link), a multiplication

Chebyshev polynomial basis function [14]. A typical application of functional-link neural networks is to model nonlinear decision boundaries in channel equalisers [15]. Dehuri and Cho [16] also use them as classifiers, where functional-links are used to select input features.

functional-link neural networks: Firstly, we use functional transfer matrices as alternatives to standard weight matrices, whereas functional expansion modules are NOT alternatives to weight matrices in functional-link neural networks. Secondly, our models have up to 10 hidden layers, and functional transfer

no hidden layer or have only one hidden layer with a linear weight matrix, and functional-links are only used to enhance input patterns [16, 13]. Thirdly, most functional transfer matrices are trainable, whereas most functional-links are fixed (except the random variable functional-link). Finally, functional transfer

memory blocks, and they are different from the applications of functional-links.

7.2. Deep architecture

used to map pixels to a symbol [3, 17], map a frame of voice to a phone [1, 18],

theorem proving [20, 21], manipulate variables in symbolic representations [22], guide the synthesis of programs [23], etc. In the above applications, deep neural networks play a role that maps input signals with different sizes, shapes and structures to vectorised features. Similarly, the first part of our experiments

are used to map pixels to vectorised features.

7.3. The improvement of neurons

originally the logistic function [24]. The logistic function can be replaced with

maxout function, or the soft-maxout function [3, 25, 26]. Also, neurons can be more complicated: For instance, they can be Clifford neurons in hyperconformal space and be used to construct ellipsoidal hypersurfaces [27]. The Clifford neurons are based on the Clifford algebra which extends multiplications

Also, they can be the Taylor series of some simple activation functions, and the Taylor series can become trainable functions, because their coefficients can be trained [29]. Another kind of trainable neurons is called the Hermitian activation function [30]. The Hermitian activation function is applied to feedforward neural

its parameters [31]. Besides, neurons can be functions processing not only real numbers, but also complex numbers, according to the theory of complex back-propagation [32]: For instance, they can form multivalued neurons in feedforward neural networks [33]. They can also be extended to the hyper-

elements with a scaler and three orthogonal vectors [34]. Further, they can be applied to not only fully connected neural networks, but also convolutional neural networks [35]. Different from the above work about neurons, our work focuses on improving the matrices between neurons instead of improving neurons

8. Conclusion and future work

are presented. As argued in Section 3, connections between nodes of a fully connected neural network can be represented by trainable functional

can be extended to the training of these functional connections. Different functional transfer matrices can form different mathematical structures, such as conic hypersurfaces, periodic functions, sleeping units and dead units. Also, multiple functional transfer matrices can form a deep structure called a deep

when the number of hidden layers increases. Practically, the use of layer-wise supervised training and fine-tuning can solve this problem. In the experiments, a wide range of functional transfer matrices are demonstrated to be trainable and be able to gain high test accuracies on the MNIST database. It is also

memorise hundreds of digits of the circumference ratio, which means that the neural network can memorise previous information and form a memory block. There also remain many possibilities worth exploring in the future: Firstly, it is possible to apply functional transfer matrices to other kinds of neural networks,

their weights may also be replaced with functional connections. Secondly, it is often possible to design functional connections for special purposes, as there are an infinite number of possible combinations of functions. Thirdly, the training of deep functional transfer neural networks still relies on the layer-wise approach.

possible to apply functional transfer matrices to more fields besides the two applications in this work. In summary, this article has carried out the concept and shown the viability of changing the structure of neural networks by replacing weights with functional connections, and it is believed that the use of functional

Acknowledgement

the Central Universities [grant number 2016JX06]; and the National Natural Science Foundation of China [grant number 61472369].

References

[1] A. Mohamed, G. E. Dahl, G. E. Hinton, Acoustic modeling using deep belief networks, IEEE Trans. Audio, Speech & Language Processing 20 (1) (2012) 14–22. doi:10.1109/TASL.2011.2109382.

[2] S. Dreiseitl, L. Ohno-Machado, Logistic regression and artificial neural network classification models: a methodology review, Journal of Biomedical Informatics 35 (5-6) (2002) 352–359. doi:10.1016/S1532-0464(03) 00034-0.

[3] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, 2011, pp. 315–323.

[4] G. F. Mont´ufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information

[5] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780. doi:10.1162/neco.1997.9.8.1735.

[6] R. Hecht-Nielsen, Theory of the backpropagation neural network, Neural Networks 1 (Supplement-1) (1988) 445–448. doi:10.1016/0893-6080(88) 90469-8. URL https://doi.org/10.1016/0893-6080(88)90469-8

[7] L. Deng, The MNIST database of handwritten digit images for machinelearning research [best of the web], IEEE Signal Process. Mag. 29 (6) (2012) 141–142. doi:10.1109/MSP.2012.2211477. URL https://doi.org/10.1109/MSP.2012.2211477

[8] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., Learning

(1988) 1.

[9] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research

[10] M. Zinkevich, M. Weimer, A. J. Smola, L. Li, Parallelized stochastic gradient descent, in: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver,

[11] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural

[12] S. Wiesler, A. Richard, R. Schl¨uter, H. Ney, Mean-normalized stochastic gradient for large-scale deep learning, in: IEEE International Conference

May 4-9, 2014, 2014, pp. 180–184. doi:10.1109/ICASSP.2014.6853582. URL http://dx.doi.org/10.1109/ICASSP.2014.6853582

[13] Y. Pao, G. H. Park, D. J. Sobajic, Learning and generalization characteristics of the random vector functional-link net, Neurocomputing

URL https://doi.org/10.1016/0925-2312(94)90053-1

[14] S. Dehuri, S. Cho, A comprehensive survey on functional link neural networks and an adaptive PSO-BP learning for CFLNN, Neural Computing and Applications 19 (2) (2010) 187–205. doi:10.1007/

URL https://doi.org/10.1007/s00521-009-0288-5

[15] H. Zhao, J. Zhang, Functional link neural network cascaded with chebyshev orthogonal polynomial for nonlinear channel equalization, Signal Processing 88 (8) (2008) 1946–1957. doi:10.1016/j.sigpro.2008.01.

URL https://doi.org/10.1016/j.sigpro.2008.01.029

[16] S. Dehuri, S. Cho, Evolutionarily optimized features in functional link neural network for classification, Expert Syst. Appl. 37 (6) (2010) 4379– 4391. doi:10.1016/j.eswa.2009.11.090.

[17] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS

URL http://www.jmlr.org/proceedings/papers/v9/glorot10a.html

[18] L. Deng, G. E. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in: IEEE International Conference on Acoustics, Speech and Signal Processing,

8603. doi:10.1109/ICASSP.2013.6639344. URL https://doi.org/10.1109/ICASSP.2013.6639344

[19] R. Sarikaya, G. E. Hinton, A. Deoras, Application of deep belief networks for natural language understanding, IEEE/ACM Trans. Audio, Speech &

2303296. URL https://doi.org/10.1109/TASLP.2014.2303296

[20] G. Irving, C. Szegedy, A. A. Alemi, N. E´en, F. Chollet, J. Urban, Deepmath - deep sequence models for premise selection, in: Advances in

Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 2235–2243.

[21] C. Cai, D. Ke, Y. Xu, K. Su, Learning of human-like algebraic reasoning

URL http://arxiv.org/abs/1704.07503

[22] C. Cai, D. Ke, Y. Xu, K. Su, Symbolic manipulation based on deep neural networks and its application to axiom discovery, in: 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK,

7966113. URL https://doi.org/10.1109/IJCNN.2017.7966113

[23] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, D. Tarlow, Deepcoder: Learning to write programs, CoRR abs/1611.01989.

[24] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.

[25] J. Bergstra, G. Desjardins, P. Lamblin, Y. Bengio, Quadratic polynomials learn better image features, Tech. rep., Technical Report 1337,

Montr´eal (2009).

[26] X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing,

10.1109/ICASSP.2014.6853589. URL https://doi.org/10.1109/ICASSP.2014.6853589

[27] C. Villase˜nor, N. Arana-Daniel, A. Y. Alanis, C. L´opez-Franco, Hyperellipsoidal neuron, in: 2017 International Joint Conference on Neural

788–794. doi:10.1109/IJCNN.2017.7965932. URL https://doi.org/10.1109/IJCNN.2017.7965932

[28] S. Buchholz, G. Sommer, On clifford neurons and clifford multi-layer perceptrons, Neural Networks 21 (7) (2008) 925–935. doi:10.1016/j.

URL https://doi.org/10.1016/j.neunet.2008.03.004

[29] H. Chung, S. J. Lee, J. G. Park, Deep neural network using trainable activation functions, in: 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016, 2016,

[30] S. M. Siniscalchi, T. Svendsen, F. Sorbello, C. Lee, Experimental studies on continuous speech recognition using neural architectures with ”adaptive” hidden activation functions, in: Proceedings of the IEEE International

14-19 March 2010, Sheraton Dallas Hotel, Dallas, Texas, USA, 2010, pp. 4882–4885. doi:10.1109/ICASSP.2010.5495120. URL https://doi.org/10.1109/ICASSP.2010.5495120

[31] S. M. Siniscalchi, J. Li, C. Lee, Hermitian polynomial for speaker

Audio, Speech & Language Processing 21 (10) (2013) 2152–2161. doi: 10.1109/TASL.2013.2270370. URL https://doi.org/10.1109/TASL.2013.2270370

[32] H. Leung, S. Haykin, The complex backpropagation algorithm, IEEE

134446. URL https://doi.org/10.1109/78.134446

[33] I. N. Aizenberg, C. Moraga, Multilayer feedforward neural network based on multi-valued neurons (MLMVN) and a backpropagation learning

s00500-006-0075-5. URL https://doi.org/10.1007/s00500-006-0075-5

[34] N. Matsui, T. Isokawa, H. Kusamichi, F. Peper, H. Nishimura, Quaternion neural network with geometrical operators, Journal of Intelligent and

URL http://content.iospress.com/articles/ journal-of-intelligent-and-fuzzy-systems/ifs00236

[35] Y. Kominami, H. Ogawa, K. Murase, Convolutional neural networks with multi-valued neurons, in: 2017 International Joint Conference on Neural

2673–2678. doi:10.1109/IJCNN.2017.7966183. URL https://doi.org/10.1109/IJCNN.2017.7966183

Appendix

layers respectively. In these tables, accuracies are recorded as x/y, where x denotes an accuracy after layer-wise supervised training, and y denotes an accuracy after fine-tuning. In particular, “” means that a model fails to be trained. To reduce the size of tables, “Activation” is abbreviated to “Act.”, and

Table 6: Full Results on MNIST — 1 Hidden Layer

Table 7: Full Results on MNIST — 5 Hidden Layers

Table 8: Full Results on MNIST — 10 Hidden Layers

designed for accessibility and to further open science