For instance, a fully connected deep neural network usually has a weight matrix in each hidden layer, and the weight matrix, a bias vector and a nonlinear
has been done on combining neural networks with different activations, such as the logistic function [2], the rectifier [3], the maxout function [4] and the long short-term memory blocks [5]. In this work, we study neural networks from another viewpoint: Instead of using different activations, we replace weights in
between nodes are represented by using real functions, but no longer real numbers. Specifically, this work focuses on:
the theory of functional transfer neural networks, including functional transfer matrices and a back-propagation algorithm for functional connections. Section 4 provides some examples for explaining the meanings of functionally connected structures. Section 5 discusses training methods for practical use. Section 6
concludes this work and discusses future work.
networks and only mention concepts and formulae relevant to our research [6]: A feedforward neural network usually consists of an input layer, some hidden layers and an output layer. Two neighbouring layers are connected by a linear weight matrix W , a bias vector b and an activation . Given an vector x, the neural network can map it to another vector y via:
3.1. Functional transfer matrices
to represent connections between nodes. Formally, it is defined as:
where each ) is a trainable function. We also define a transfer computation “
” for the functional transfer matrix and a column vector v = (
such that:
Suppose that b = (is a bias vector,
) is an activation, and x = (
is an input vector. An output vector y =
(can then be computed via
In other words, the ith element of the output vector is
3.2. Back-propagation45
vectors and activations to form a multi-layer model, and the model can be trained via back-propagation [6]: Suppose that a model has L hidden layers. For the lth hidden layer, is a functional transfer matrix, its function
(x) is a real function with an independent variable x and r trainable parameters
= 1
= (
)
is a bias vector,
) is an activation, and
= (
)
is an input signal. An output signal
= (
)
is computed via:
The output signal is the input signal of the next layer. In other words, if , then
=
. After an error signal is evaluated by an output layer and an error function (see Section 5.2), the parameters can be updated: Suppose that
= (
)
is an error signal of output nodes, and
is a learning rate. Let
Deltas of the biases are the same as the conventional rule:
An error signal of the (1)th layer is computed via:
Please note that the computation of (
)
can be simplified as a transfer
computation: Let be a derivative matrix in which each element is
Then the computation can be done via . Similarly, the computation of
the transfer computation.
4.1. Periodic functions
instance, f(x) = ) is a periodic function, where
and
are its amplitude, angular velocity and initial phase respectively. Thus, we can define a matrix consisting of the following function:
where and
are trainable parameters. If it is an
matrix, then it can transfer an N-dimensional vector (
to an M-dimensional vector (
such that
). It is noticeable that each
is a composition of many periodic functions, and their amplitudes, angular velocities and initial phases can be updated via training. The training process is based on the following partial
4.2. Modelling of conic hypersurfaces
constructed by ellipses and hyperbolas. Recall the mathematical definition of ellipses: A ellipse with a centre () and semi-axes
and
can
positive number. Let =
(
. The equation becomes
= 0. It is noticeable that
can be rewritten to
· (
· (
⊙
(See Eq. 3). Therefore, a model with an activation ) can be defined as
y = ξ((x
(x
and y is an output. Based on the above structure, multiple ellipse decision
150)
00)
100)
00)
and
In the above model, and
are input nodes.
and
are hidden nodes. z is an output node.
) is an activation. The input nodes and the hidden nodes are connected via a functional transfer matrix. The hidden nodes and the output node are connected via a weight matrix. The hidden nodes and the output node are activated via biases and the activation. Fig. 1 shows decision boundaries formed by this model: The functional transfer matrix and the hidden nodes form three ellipse boundaries (in orange, green and purple respectively). The weight matrix and the output node form a boundary which is the union of inner parts of all ellipse boundaries. In addition, this figure shows some example inputs and outputs: If the inputs are inside the decision boundaries, the output is 1. Otherwise, the output is 0. The reason why the model can construct ellipse boundaries is that its functional transfer matrix consists of the following function:
where and
are trainable parameters. If the input dimension of this matrix is 2, it will construct ellipse boundaries on a plane. If its input dimension is 3, it will construct ellipsoid boundaries in a 3-dimensional space. Generally, if
Figure 1: Decision boundaries formed by Eq. (15), Eq. (16) and Eq. (17).
its input dimension is N (N >= 2), it will construct boundaries represented by (1)-dimensional closed hypersurfaces of ellipses in an N-dimensional space. This phenomenon reflects a significant difference between the a transfer matrix with Eq. (18) and a standard linear weight matrix, as the former models closed hypersurfaces, while the later models hyperplanes. In addition, to use the back-propagation algorithm to train the parameters, derivatives are computed via:
also hyperbolas. The adapted function is defined as:
where is initialised as 1 or -1. Please note that
is a constant, but NOT a trainable parameter. Given a 1
2 matrix with this function:
If it is initialised by:
then it can represent an ellipse boundary. On the other hand, if it is initialised
by:
then it can be used to form an hyperbola boundary. More generally, in an
N-dimensional space, an functional transfer matrix with Eq. (22) can represent M different (
1)-dimensional conic hypersurfaces.
4.3. Sleeping weights and dead weights60
“sleep”. The reason is that the connection from Node j to Node i is a real weight . Let
denote a state of Node j. Node i will receive a signal
. When
= 0 and
= 0, the signal always influences Node i, regardless of whether or not Node i needs it. To enable the connections to “sleep” temporarily, the following function is used in a functional transfer matrix:
In the above function, and
are trainable parameters, and
is a constant which is initialised as 1 or -1 before training. It also makes use of the rectifier [3]:
The rectifier makes the function be able to “sleep”. In other words, when 0,
) must be zero. To use the back-propagation algorithm
to train the parameters, derivatives are computed via:
word “die” is different from the word “sleep”, as the former means that ) = 0 for all x, and the later means that
) = 0 for some x. If the connection from Node j to Node i is dead, then any change of Node j does not influence Node i. This function is:
In the above function, is a trainable parameter, and
is a constant which is initialised as 1 or -1 before training. To use the back-propagation algorithm to train the parameters, derivatives are computed via:
It is noticeable that ),
means that the function does not transfer any signal and cannot be updated in
this case. In other words, the function is dead once is updated to a negative value. A matrix with this function can then represent a partially connected neural network, as dead functions can be considered as broken connections after training. A problem about the function is that it is not able to “revive”. In other words, once
is updated to a negative value, it has no chance to be non-negative again. To solve this problem, an adapted version of the rectifier is used:
For the computation of
We use “” instead of “=” because this derivative is not a mathematically sound result, but a predefined value. Thus, derivatives of
) are computed
It is noticeable that
can be updated when it is negative. Thus, although the function is dead
when is negative, it can be revived by updating
to a non-negative value.
4.4. Sequential Modelling via Memory Functions
sequences. For instance, given a sequential input , the tth
state (t = 1) of a memory function
) is computed via:
where and
are trainable parameters. In particular,
is a memory cell and its initial state
is set to 0. To use the back-propagation algorithm to train the parameters, derivatives are computed via:
signals from a previous state, as each memory function records one signal. On the other hand, a standard recurrent neural network usually uses hidden units to record signals from the previous state. If it has N input units and M hidden units, then it can record M signals. It is noticeable that the functional transfer
recurrent neural network.
neural networks. These methods are also used in the experiments (Section 6).
Figure 2: The general structure of functional transfer neural networks.
5.1. The model structure and initialisation75
They usually have an input layer, one or more hidden layers and an output layer. Each hidden layer consists of a functional transfer matrix and some hidden units with a bias vector and an activation function. In particular, the activation function can be the logistic sigmoid function [3]:
the rectified linear unit (ReLU)
or the hyperbolic tangent (tanh) function
The output layer consists of a linear weight matrix and N output units with a bias vector and a softmax function:
initialised based on the following method:
5.2. Training a single hidden layer
layer, and an output signal is generated. Then an error signal is computed by comparing the output signal with a target signal. The comparison is based on the cross-entropy criterion [9]. Next, the error signal is back-propagated through the output layer and the hidden layer, and deltas of parameters are computed.
the rules described by Section 3.2 are used. Finally, the parameters are updated according to their deltas. The above process can be combined with stochastic gradient descent [10].
5.3. Layer-wise supervised training and fine-tuning
number of hidden layers increases. To resolve this problem, the combination
Figure 3: Layer-wise supervised training and fine-tuning. Functional connections are denoted by solid lines, and linear weight connections are denoted by dashed lines.
of layer-wise supervised trainingand fine-tuning is used [11], which is shown by Fig. 3. Layer-wise training includes the following steps: Firstly, the first hidden layer is trained, while the other hidden layers and the output layer are
and a bias vector) is added onto it, and the method for training a single hidden layer (described by Section 5.2) is used to train it. Then the softmax layer is removed, and the second hidden layer is added onto the first hidden layer. To train the second hidden layer, another new softmax layer is added onto it, and
is not updated in this step. Finally, all remaining hidden layers are trained by using the same method. In particular, the trained softmax layer on the last hidden layer is considered as the output layer of the whole neural network. After layer-wise training, the whole neural network can be further optimised via
For the hidden layers, the rules described by Section 3.2 are used. In practice, learning rates for fine-tuning are smaller than those for layer-wise training.
6.1. MNIST handwritten digit recognition
6.1.1. Experimental settings115
transfer neural networks. ) in the functional matrix (See Eq. (2)) is substituted for these example functions. In particular,
and
are used to denote trainable parameters instead of using
. These functions
Section 4.4: F06, F07, and F08 have been discussed in Section 4.3; F12 has been discussed in Section 4.1; F19 and F20 have been discussed in Section 4.2. They also include functions which have not been discussed before, such as F01 - F05, F09 - F11 and F13 - F18. We test these examples to explore the
be useful, but it is difficult to cover all possible functions in this paper. On the other hand, some kinds of functions may not work. For instance, we have found that ) =
) does not work, because the range of f(x) = cosh(x) is [1
), which means that this function does not have a
range to [0), as indicated by F15. For another instance, f(x) = tan(x) does not work, because it is a discontinuous function. This is the reason why F12 cannot be changed to
) =
). For the same reason, some other discontinuous functions, including f(x) = cot(x), f(x) = sec(x),
used in functional networks.
Section 5.1. In particular, the input layer has 784 nodes, and the output layer has 10 nodes. They correspond to input pixels and output labels of the MNIST
layer has 128 hidden units with the logistic activation, the ReLU or the tanh activation. Parameters of functional transfer matrices in these hidden layers
Table 1: Example Functions Used to Model Functional Networks.
Table 2: Initialisation of Transfer Matrices.
are initialised based on Table 2. All training processes make use of the layer-wise supervised training and fine-tuning strategy discussed in Section 5.3. In
and the learning rate is set to a constant 2, where
. For fine-tuning, the Newbob+/Train strategy is used [12]: The learning rate is set to 2
initially, and it will be halved if the improvement of the cross-entropy loss is smaller than 10
. The training stops when the improvement is smaller
fine-tuning, the size of mini-batches is set to 16.
6.1.2. Results
include accuracies after layer-wise training (LWT), accuracies after fine-tuning
In addition, F01 - F20 are IDs of functional transfer matrices, and Logistic,
Table 3: Accuracies (%) and the Best — 1 Hidden Layer.
ReLU and Tanh are activations of hidden units. It is noticeable that most models are trained successfully, except the F11-ReLU model. Although this model fails to be trained, its counterparts, including the F11-Logistic model
best is different for different functions and activations, which means that the initial learning rate should be carefully selected for different models.
8 models failed to be trained, which means that training is becoming more
are 52 models trained successfully, which demonstrates that the revised back-propagation rules for functional transfer matrices are able to work for deep models. Among these models, 14 models obtain a better accuracy (after fine-tuning) than their counterparts in Table 3, whereas 38 models obtain a worse
better than less hidden layers, and vice versa. In addition, for a certain function, the best for ReLU is usually smaller than that for Logistic, and the best
for Tanh is always smaller than that for Logistic. A similar phenomenon about
has also appeared in Table 3.
are 11 models failed to be trained and 49 models trained successfully, which means that the functional transfer matrices still can work when there are up to 10 hidden layers, but training has become more difficult. Among the models trained successfully, 12 models obtain a better accuracy (after fine-tuning) than
obtains the same accuracy. These results show again that increasing the number of hidden layers can (but not always) bring about a worse accuracy. In addition, a comparison among the values in Table 3, Table 4 and Table 5 reveals that the best
values are influenced by the functional transfer matrices, the activations
Table 4: Accuracies (%) and the Best — 5 Hidden Layers.
Table 5: Accuracies (%) and the Best — 10 Hidden Layers.
Therefore, tuning of is still required in practice.
6.2. Memorising the digits of π
6.2.1. Experimental settings
evaluated by using the same methodology as the previous experiments. In this part, we explore if functional transfer neural networks with the memory function are able to model the sequence of the circumference ratio= 3.141592653589793238
. Specifically, given the first
1 digits, the neural
input layer, a hidden layer and an output layer: The input layer has 10 nodes which correspond to digits 09. The hidden layer has 128, 256, 512 or 1024 nodes, and it consists of a functional transfer matrix with the memory function, a bias vector and a logistic activation. The output layer has 10
matrix, a bias vector and a softmax activation. All trainable parameters in the two matrices are initialised as a real number in [1]. All biases are initialised as zero. The training method described by Section 5.2 is used to train the models. Each training epoch makes use of a sequence of training pairs
hot encodings of the ith and (i + 1)th digits of respectively.
is an input, and
is a target of output. D is set to 200, 400 or 800. The number of training epochs is set to 5000. The learning rate is set to 2
, and it is not changed during the whole training process. Model performance is evaluated by
Figure 4: Learning curves of functional transfer neural networks with the memory function.
6.2.2. Results
model is denoted by “xH-yD”, where x is the number of hidden units, and y the number of training pairs. It is noticeable that all 200D models gain high
and 1024H-400D models gain high accuracies after the first 2500 training epochs, which means that matrices with the memory function are able to model the sequence of digits. However, the 128H-400D model does not gain an accuracy as high as its counterparts, which means that its memory is restricted by the
of hidden units should increase, because it determines the size of the functional transfer matrix and determines consequently the number of memory functions. For the 800D models, it is clear that the increase of the number of hidden units brings about better accuracies, though the accuracies are not significantly high.
deep neural networks and neurons are revised, and comparisons between them
and our work are carried out.
7.1. Functional-link neural networks
patterns [13]. Usually, they consist of a functional expansion module, an input layer and an output layer, and the functional expansion module consists of many functional-links: For instance, a functional-link can be a trainable linear function (which is called a random variable functional-link), a multiplication
Chebyshev polynomial basis function [14]. A typical application of functional-link neural networks is to model nonlinear decision boundaries in channel equalisers [15]. Dehuri and Cho [16] also use them as classifiers, where functional-links are used to select input features.
functional-link neural networks: Firstly, we use functional transfer matrices as alternatives to standard weight matrices, whereas functional expansion modules are NOT alternatives to weight matrices in functional-link neural networks. Secondly, our models have up to 10 hidden layers, and functional transfer
no hidden layer or have only one hidden layer with a linear weight matrix, and functional-links are only used to enhance input patterns [16, 13]. Thirdly, most functional transfer matrices are trainable, whereas most functional-links are fixed (except the random variable functional-link). Finally, functional transfer
memory blocks, and they are different from the applications of functional-links.
7.2. Deep architecture
used to map pixels to a symbol [3, 17], map a frame of voice to a phone [1, 18],
theorem proving [20, 21], manipulate variables in symbolic representations [22], guide the synthesis of programs [23], etc. In the above applications, deep neural networks play a role that maps input signals with different sizes, shapes and structures to vectorised features. Similarly, the first part of our experiments
are used to map pixels to vectorised features.
7.3. The improvement of neurons
originally the logistic function [24]. The logistic function can be replaced with
maxout function, or the soft-maxout function [3, 25, 26]. Also, neurons can be more complicated: For instance, they can be Clifford neurons in hyperconformal space and be used to construct ellipsoidal hypersurfaces [27]. The Clifford neurons are based on the Clifford algebra which extends multiplications
Also, they can be the Taylor series of some simple activation functions, and the Taylor series can become trainable functions, because their coefficients can be trained [29]. Another kind of trainable neurons is called the Hermitian activation function [30]. The Hermitian activation function is applied to feedforward neural
its parameters [31]. Besides, neurons can be functions processing not only real numbers, but also complex numbers, according to the theory of complex back-propagation [32]: For instance, they can form multivalued neurons in feedforward neural networks [33]. They can also be extended to the hyper-
elements with a scaler and three orthogonal vectors [34]. Further, they can be applied to not only fully connected neural networks, but also convolutional neural networks [35]. Different from the above work about neurons, our work focuses on improving the matrices between neurons instead of improving neurons
are presented. As argued in Section 3, connections between nodes of a fully connected neural network can be represented by trainable functional
can be extended to the training of these functional connections. Different functional transfer matrices can form different mathematical structures, such as conic hypersurfaces, periodic functions, sleeping units and dead units. Also, multiple functional transfer matrices can form a deep structure called a deep
when the number of hidden layers increases. Practically, the use of layer-wise supervised training and fine-tuning can solve this problem. In the experiments, a wide range of functional transfer matrices are demonstrated to be trainable and be able to gain high test accuracies on the MNIST database. It is also
memorise hundreds of digits of the circumference ratio, which means that the neural network can memorise previous information and form a memory block. There also remain many possibilities worth exploring in the future: Firstly, it is possible to apply functional transfer matrices to other kinds of neural networks,
their weights may also be replaced with functional connections. Secondly, it is often possible to design functional connections for special purposes, as there are an infinite number of possible combinations of functions. Thirdly, the training of deep functional transfer neural networks still relies on the layer-wise approach.
possible to apply functional transfer matrices to more fields besides the two applications in this work. In summary, this article has carried out the concept and shown the viability of changing the structure of neural networks by replacing weights with functional connections, and it is believed that the use of functional
the Central Universities [grant number 2016JX06]; and the National Natural Science Foundation of China [grant number 61472369].
[1] A. Mohamed, G. E. Dahl, G. E. Hinton, Acoustic modeling using deep belief networks, IEEE Trans. Audio, Speech & Language Processing 20 (1) (2012) 14–22. doi:10.1109/TASL.2011.2109382.
[2] S. Dreiseitl, L. Ohno-Machado, Logistic regression and artificial neural network classification models: a methodology review, Journal of Biomedical Informatics 35 (5-6) (2002) 352–359. doi:10.1016/S1532-0464(03) 00034-0.
[3] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 11-13, 2011, 2011, pp. 315–323.
[4] G. F. Mont´ufar, R. Pascanu, K. Cho, Y. Bengio, On the number of linear regions of deep neural networks, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information
[5] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780. doi:10.1162/neco.1997.9.8.1735.
[6] R. Hecht-Nielsen, Theory of the backpropagation neural network, Neural Networks 1 (Supplement-1) (1988) 445–448. doi:10.1016/0893-6080(88) 90469-8. URL https://doi.org/10.1016/0893-6080(88)90469-8
[7] L. Deng, The MNIST database of handwritten digit images for machinelearning research [best of the web], IEEE Signal Process. Mag. 29 (6) (2012) 141–142. doi:10.1109/MSP.2012.2211477. URL https://doi.org/10.1109/MSP.2012.2211477
[8] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., Learning
(1988) 1.
[9] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., Deep neural networks for acoustic modeling in speech recognition: The shared views of four research
[10] M. Zinkevich, M. Weimer, A. J. Smola, L. Li, Parallelized stochastic gradient descent, in: Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver,
[11] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in: Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural
[12] S. Wiesler, A. Richard, R. Schl¨uter, H. Ney, Mean-normalized stochastic gradient for large-scale deep learning, in: IEEE International Conference
May 4-9, 2014, 2014, pp. 180–184. doi:10.1109/ICASSP.2014.6853582. URL http://dx.doi.org/10.1109/ICASSP.2014.6853582
[13] Y. Pao, G. H. Park, D. J. Sobajic, Learning and generalization characteristics of the random vector functional-link net, Neurocomputing
URL https://doi.org/10.1016/0925-2312(94)90053-1
[14] S. Dehuri, S. Cho, A comprehensive survey on functional link neural networks and an adaptive PSO-BP learning for CFLNN, Neural Computing and Applications 19 (2) (2010) 187–205. doi:10.1007/
URL https://doi.org/10.1007/s00521-009-0288-5
[15] H. Zhao, J. Zhang, Functional link neural network cascaded with chebyshev orthogonal polynomial for nonlinear channel equalization, Signal Processing 88 (8) (2008) 1946–1957. doi:10.1016/j.sigpro.2008.01.
URL https://doi.org/10.1016/j.sigpro.2008.01.029
[16] S. Dehuri, S. Cho, Evolutionarily optimized features in functional link neural network for classification, Expert Syst. Appl. 37 (6) (2010) 4379– 4391. doi:10.1016/j.eswa.2009.11.090.
[17] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS
URL http://www.jmlr.org/proceedings/papers/v9/glorot10a.html
[18] L. Deng, G. E. Hinton, B. Kingsbury, New types of deep neural network learning for speech recognition and related applications: an overview, in: IEEE International Conference on Acoustics, Speech and Signal Processing,
8603. doi:10.1109/ICASSP.2013.6639344. URL https://doi.org/10.1109/ICASSP.2013.6639344
[19] R. Sarikaya, G. E. Hinton, A. Deoras, Application of deep belief networks for natural language understanding, IEEE/ACM Trans. Audio, Speech &
2303296. URL https://doi.org/10.1109/TASLP.2014.2303296
[20] G. Irving, C. Szegedy, A. A. Alemi, N. E´en, F. Chollet, J. Urban, Deepmath - deep sequence models for premise selection, in: Advances in
Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, 2016, pp. 2235–2243.
[21] C. Cai, D. Ke, Y. Xu, K. Su, Learning of human-like algebraic reasoning
URL http://arxiv.org/abs/1704.07503
[22] C. Cai, D. Ke, Y. Xu, K. Su, Symbolic manipulation based on deep neural networks and its application to axiom discovery, in: 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK,
7966113. URL https://doi.org/10.1109/IJCNN.2017.7966113
[23] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, D. Tarlow, Deepcoder: Learning to write programs, CoRR abs/1611.01989.
[24] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
[25] J. Bergstra, G. Desjardins, P. Lamblin, Y. Bengio, Quadratic polynomials learn better image features, Tech. rep., Technical Report 1337,
Montr´eal (2009).
[26] X. Zhang, J. Trmal, D. Povey, S. Khudanpur, Improving deep neural network acoustic models using generalized maxout networks, in: IEEE International Conference on Acoustics, Speech and Signal Processing,
10.1109/ICASSP.2014.6853589. URL https://doi.org/10.1109/ICASSP.2014.6853589
[27] C. Villase˜nor, N. Arana-Daniel, A. Y. Alanis, C. L´opez-Franco, Hyperellipsoidal neuron, in: 2017 International Joint Conference on Neural
788–794. doi:10.1109/IJCNN.2017.7965932. URL https://doi.org/10.1109/IJCNN.2017.7965932
[28] S. Buchholz, G. Sommer, On clifford neurons and clifford multi-layer perceptrons, Neural Networks 21 (7) (2008) 925–935. doi:10.1016/j.
URL https://doi.org/10.1016/j.neunet.2008.03.004
[29] H. Chung, S. J. Lee, J. G. Park, Deep neural network using trainable activation functions, in: 2016 International Joint Conference on Neural Networks, IJCNN 2016, Vancouver, BC, Canada, July 24-29, 2016, 2016,
[30] S. M. Siniscalchi, T. Svendsen, F. Sorbello, C. Lee, Experimental studies on continuous speech recognition using neural architectures with ”adaptive” hidden activation functions, in: Proceedings of the IEEE International
14-19 March 2010, Sheraton Dallas Hotel, Dallas, Texas, USA, 2010, pp. 4882–4885. doi:10.1109/ICASSP.2010.5495120. URL https://doi.org/10.1109/ICASSP.2010.5495120
[31] S. M. Siniscalchi, J. Li, C. Lee, Hermitian polynomial for speaker
Audio, Speech & Language Processing 21 (10) (2013) 2152–2161. doi: 10.1109/TASL.2013.2270370. URL https://doi.org/10.1109/TASL.2013.2270370
[32] H. Leung, S. Haykin, The complex backpropagation algorithm, IEEE
134446. URL https://doi.org/10.1109/78.134446
[33] I. N. Aizenberg, C. Moraga, Multilayer feedforward neural network based on multi-valued neurons (MLMVN) and a backpropagation learning
s00500-006-0075-5. URL https://doi.org/10.1007/s00500-006-0075-5
[34] N. Matsui, T. Isokawa, H. Kusamichi, F. Peper, H. Nishimura, Quaternion neural network with geometrical operators, Journal of Intelligent and
URL http://content.iospress.com/articles/ journal-of-intelligent-and-fuzzy-systems/ifs00236
[35] Y. Kominami, H. Ogawa, K. Murase, Convolutional neural networks with multi-valued neurons, in: 2017 International Joint Conference on Neural
2673–2678. doi:10.1109/IJCNN.2017.7966183. URL https://doi.org/10.1109/IJCNN.2017.7966183
layers respectively. In these tables, accuracies are recorded as x/y, where x denotes an accuracy after layer-wise supervised training, and y denotes an accuracy after fine-tuning. In particular, “” means that a model fails to be trained. To reduce the size of tables, “Activation” is abbreviated to “Act.”, and
Table 6: Full Results on MNIST — 1 Hidden Layer
Table 7: Full Results on MNIST — 5 Hidden Layers
Table 8: Full Results on MNIST — 10 Hidden Layers