Convolutional neural networks (CNNs) [1] and back propagation (BP) learning [2] are the most powerful combination of recent machine learning methods. However, when these learning methods are likened to information processing in the brain, several objections mainly directed toward the BP of error information are raised [3]. Conversely, the brain is a more powerful learning machine than any current deep learning systems, which simultaneously realizes the possibility of scaling computations to a very large network and strong semi-supervised learning. Therefore, it can be said that clarifying the operational principles of the brain is important for realizing a superior learning machine.
In recent years, biologically-motivated methods have predominantly been studied in cases where learning can be performed by estimating BP errors obtained from other feedback signals without the actual usage of BP errors [4–9]. However, because these methods also represent supervised learning using labeled data, in which considerable labeled data are required for learning, brain-like computing cannot be realized. To solve this problem, a version of unsupervised learning that does not use any BP information is required.
Numerous studies on unsupervised learning of visual representation have been conducted. It was initially studied using analytical methods. Some previous studies applied independent component analysis to obtain visual bases from natural scene images [10, 11]. This method was subsequently expanded to a neural network (NN)-like regime [12]. Because these methods were vastly different from conventional NNs that were trained using BP learning, the unification of neural and nonneural mechanisms has been challenging.
By contrast, similar to an NN-like architecture, the Boltzmann machine and its families were applied for pre-training to initialize weight parameters [13, 14]. However, the structure and dynamics of these two NNs are completely different. Thus, there is some inefficiency in the conversion from the obtained learning representation in the Boltzmann machine to a target feedforward NN.
Autoencoder [15] might address the conversion problem; it is a variation of the feedforward NN, and the weight parameter is wholly applicable to a traditional feedforward NN. However, CNNs require additional processes because of the non-linearity of the deconvolution processes. Thus, generator-based unsupervised learning is used for the smooth transfer of representation learning [16]. Goroshin [17] used a sophisticated approach that transfers the visual bases directly into the sub-network of the target CNN using static object information in movies. However, these networks still rely on back-propagated signals, and learning methods that do not depend on back-propagated signals are nonetheless considered necessary.
Competitive learning is a method of unsupervised learning that learns by a simple sparse mechanism called "winner takes all" (WTA). Competitive learning does not use any BP information, and instead simply uses local feedforward information. It has been used in networks termed self-organizing maps (SOMs) [18] and Neocognitron [19], and recently it has attracted attention as a plausible biological learning method [20–22]. However, the conventional method is not always compatible with CNNs, which is the basic structure of the present deep neural network (DNN). In this study, I propose a novel method to apply competitive learning to a multilayer CNN.
One of the most important mechanisms supporting the recent development of CNNs is a rectified linear unit (ReLU) activation function, which realizes efficient sparsification through a combination with BP learning. However, it does not work sufficiently without BP signals. Therefore, I propose a method that uses WTA as an activation function, which enables sparse interlayer propagation without BP information. Autoencoder using the WTA mechanism as an activation function is mainly studied. [23, 24]. In this study, I combine the WTA activation function and competitive learning, which also uses WTA dynamics, and realize unsupervised learning without BP signals in a DNN.
The proposed method was evaluated by an image discrimination task and demonstrated a drastic acceleration in the initial learning speed. Further, it achieved a state-of-the-art performance as a biologically-motivated method for the top-5 test error categories in the ImageNet experiment. The results suggested that this method obtains higher-level learning representation from unlabeled data and is usable for the following gradient-based fine tuning.
2.1 Competitive Learning
The proposed method was based on traditional competitive learning [25, 26], which is used in Neocognitron [19] and SOMs [18]. I used the simplest competitive learning method, i.e., the WTA algorithm, and did not use any position information over the filter axis. The WTA algorithm is only applied to the weight parameter update process and has no effect on the feedforward signal. The WTA algorithm was performed for the input vector of the l-th layer, which is described as , where
is the output vector of the previous layer and
is the connection matrix. If the activation function is a monotone increasing function, the unit with the maximum input value is the one with the maximum output value. I termed it the "winner" unit and performed a weight update of the competitive learning only for that unit. Therefore, the weight gradient of the i-th unit of the l-th layer was described as follows:
where is the learning coefficient of competitive learning, and was set to
. The weight vector was normalized by L2-norm at every update to stabilize the competitive learning [18]:
We also introduces the conscience factor proposed by DeSieno [27] for the winner processing to improve the learning efficiency for large networks. It adjusts the balance of winning ratio among units, preventing only some units becoming dominant. It gives the unit with the initial noise pattern a chance to win and enables equal learning for all units. The conscience factor in this study was described as follows:
where C is the constant of conscience, and I empirically determined it to be 5.0 for the whole network. N is the number of units of the target layer, and is the probability of winning for the i-th unit in the minibatch. The final version of the weight gradient of competitive learning was described as follows:
Competitive learning is treated as unsupervised pre-training. First, it extracts the basis (e.g., Gabor patches for natural scene images and harmonics for audio) from unlabeled input data. Then, gradient-based learning is applied only to the last fully-connected layer as a fine-tuning using the learning representation obtained through competitive learning. It is important to note that the competitive learning is applied to the entire network at once, which is fundamentally different from the conventional autoencoders which mostly behave in a layer-wise manner.
2.2 WTA as an activation function
In recent neural networks, sparse dynamics are essential for DNNs, and ReLU is their key. As sparse dynamics are strongly controlled by bias factors, which are modulated by BP learning, different form of sparse dynamics were required for the proposed method. WTA is one of the most simple activation functions for sparseness. Thus, instead of ReLU, I employed WTA again as an activation function for convolutional layers . The WTA algorithm for each feature map functioned similar to a functional column in the biological brain. Each location in each channel has a corresponding winner. For example, an output with 5 x 5 pixels in a single channel has 25 winners, and is a bundle of 25 one-hot vectors.
To investigate the proposed method’s adaptation to a difficult dataset, I employed the ImageNet dataset [28] for verification. Many potential biological models have not been tested on a dataset possessing this level of complexity. The results were compared with the few previous studies that tackled the problem [3]. I also evaluated two simpler image datasets: MNIST [29] and CIFAR-10[30]. Their results have been described in the appendix.
I employed Alexnet [1] as the baseline, and modified it for the application of the proposed hierarchical competitive learning. Fig.A.1(c) shows the network structure for the ImageNet task. I reduced the number of layers with information for the same spatial resolution, and increased the number of channels of spatial frequency in respective convolutional layers. Eventually, the network constituted three convolutional and one fully-connected layers. The first convolutional layer comprised 256 filters with 3 x 3 pixels. The second and third layers possessed an inception[31]-like structure to avoid obtaining only low frequency filters, and the layers consisted of 1024 filters with 3 x 3 and 5 x 5 pixels. The convolutional layers were accompanied by the WTA activation function and 4 x 4 maxpooling with a 2-pixel stride. The fully-connected layer had 1000 units corresponding to the number of labels of ImageNet.
The size of the mini-batch was 8 for competitive learning and 64 for the subsequent fine-tuning. The data augmentation comprised horizontal flips and random crops from five positions (i.e., center, upper right, upper left, lower right, and lower left). The number of iterations for the pre-training using competitive learning and fine-tuning using gradient-based learning were 150,000 and 60,000, respectively. All learning processes employed a conventional stochastic gradient descent (SGD) method for the weight update using a learning coefficient of 0.01, and averaged cross-entropy was employed for the loss function. The learning rate of the fine tuning was reduced to one tenth every 20,000 iterations. I employed LeCun normal [32] for weight initialization.
All codes were implemented using Python and Chainer deep learning framework (v.4.5.0) [33] with GPU support. All experiments were performed on an NVIDIA DGX-1 with Tesla P100, and CUDA (v.9.0) and cuDNN (v.7.1.4) libraries were used. My code is available at https://github.com/ t-shinozaki/convcp/.
Tab.1 shows the results of the experiment. The proposed method outperformed previous biologically-motivated methods, and achieved a state-of-the-art performance for both top-1 and top-5 test error categories.
In this study, I proposed a novel unsupervised pre-training method for CNNs using multi-layer competitive learning with a WTA activation function and conscience factor. In the experiment, the competitive learning algorithm obtained high-level features from unlabeled data and achieved state-of-the-art results as a biologically-motivated method in the ImageNet experiment.
With the proposed method, a larger network was needed to realize learning at the same level as a conventional method. It is generally considered that the features obtained by conventional BP learning are selected and reduced according to the target task, whereas those obtained by competitive learning are all-inclusive over total input signals. The ability to utilize more feature quantities is expected to increase the resistance to adversarial attacks [34, 35]; However, this resistance is accompanied by an increase in the network size. Although these characteristics could provide flexibility for several tasks, pruning might be necessary for the efficient inference of a specific task.
An advantage of the proposed method is that it can acquire more diverse learning representations compared with conventional BP learning. Moreover, this diverse learning representation provides strong adaptability to various types of data, which implies that the proposed method is suitable for increasing the number of filters. By contrast, the representation learning of conventional CNNs is relatively weak particularly in their early layers, and they experience difficulty in increasing the number of filters near the input side because of the degradation of BP signals. As a result, BP learning sometimes requires repetitive structures with or without residual connections and acquires the same spatial level filter set across multiple layers. This might be one of the reasons why recent DNNs possess extremely deep structures. The proposed method could enlarge the number of filters in CNNs and enable broad and effective information processing in each layer.
Finally, because the implementation of the proposed method enables seamless switching between unsupervised learning using competitive learning and supervised learning using gradient-based learning, the method could also be useful for the mixture condition of the two learning methods termed as semi-supervised learning. Furthermore, the proposed method is also expected to be applied to few-shot learning scenarios because it can effectively utilize unlabeled data. However, further studies are required.
This work was supported by JST ERATO Grant Number JPMJER1801, Japan.
[1] Krizhevsky, A., Sutskerver, I., and Hinton, G.E. ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25:1106-1114, 2012.
[2] Rumelhart, D.D., Hinton, G.E., and Williams, R. J. Learning representations by back-propagating errors. Nature, 323(9):533-536, 1986.
[3] Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G.E., and Lillicrap, T. Assessing the scalability of biologically-motivated deep learning algorithms and architectures. Advances in Neural Information Processing Systems, 31:9368-9378, 2018.
[4] LeCun, Y. Learning process in an asymmetric threshold network. Disordered systems and biological organization, Springer, 1986.
[5] Hinton, G.E. How to do backpropagation in a brain. Deep Learning Workshop in NIPS, 2007.
[6] Bengio, Y. How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906, 2014.
[7] Lee, D.H., Zhang, S., Fischer, A., and Bengio, Y. Difference target propagation. Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 498-515, 2015.
[8] Lillicrap, T.P., Cownden, D., Tweed, D.B., and Akerman, C.J. Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7:13276, 2016.
[9] Samadi, A., Lillicrap, T.P., and Tweed, D.B. Deep learning with dynamic spiking neurons and fixed feedback weights. Neural computation, 29:578-602, 2017.
[10] Olshausen, B.A. and Field, D.J. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607-609, 1997.
[11] Hyvärinen, A. and Hoyer, P.O. A two layer sparse coding model learns simple and complex cell receptive fields and topography from natural images. Vision Research, 41:2413-2423, 2001.
[12] Le, Q.V., Ranzato, M.A., Monga, R., Devin, M., Chen, K., Corrado, G.S., Dean, J. and Ng, A.Y. Building high-level features using large scale unsupervised learning. Proc. 29th Int. Conf. on Machine Learning, 2012.
[13] Hinton, G.E. Training Products of Experts by Minimizing Contrastive Divergence. Neural Computation, 14:1771-1800, 2002.
[14] Hinton, G.E., and Salakhutdinov, R. Reducing the dimensionality of data with neural networks. Science, 313:504-507, 2006.
[15] Bengio, Y., Lambling, P., Popovici, D., and Larochelle, H. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems, 19:153-160, 2007.
[16] Radford, A. and Metz, L. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, arXiv:1511.06434, 2016.
[17] Goroshin, R., Bruna, J., Tompson, J., Eigen, D., and LeCun, Y. Unsupervised Learning of Spatiotemporally Coherent Metrics. arXiv preprint arXiv:1412.6056, 2015.
[18] Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern., 43:59-69, 1982.
[19] Fukushima, K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern., 36:193-202, 1980.
[20] Shinozaki, T. Biologically inspired feedforward supervised learning for deep self-organizing map networks. MLINI Workshop in NIPS, arXiv:1710.09574, 2016.
[21] Shinozaki, T. Competitive Learning Enriches Learning Representation and Accelerates the Fine-tuning of CNNs. DLTP Workshop in NIPS, arXiv:1804.09859, 2017.
[22] Krotov, D. and Hopfield, J.J. Unsupervised learning by competing hidden units. Proceedings of the National Academy of Sciences, 201820458, 2019.
[23] Makhzani, A. and Frey, B. A winner-take-all method for training sparse convolutional autoencoders. Deep Learning Workshop in NIPS, arXiv:1409.2752, 2014.
[24] Srivastava, R.K., Masci, J., Kazerounian, S., Gomez, F., and Schmidhuber, J. Compete to compute. Advances in Neural Information Processing Systems, 26:2310-2318, 2013.
[25] Rumelhart, D.E. and Zipser, D. Feature discovery by competitive learning. Cognitive Science, 9:75-112, 1985.
[26] Grossberg, S. Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11:23-63, 1987.
[27] DeSieno, D. Adding a conscience to competitive learning. IEEE Int. Conf. on Neural Networks, 1(6) 117-124, 1988.
[28] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., and Berg, A.C. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211-252, 2015.
[29] LeCun, Y., Cortes, C., and Barges, C.J.C. The MNIST database of handwritten digits, 1998.
[30] Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images, 2009.
[31] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., and Rabinovich, A. Going deeper with convolutions. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2015.
[32] LeCun, Y., Bottou, L., Orr, G., and Muller, K.R. Efficient backprop. Neural Networks: Tricks of the Trade. New York: Springer, 1998.
[33] Tokui, S., Oono, K., Hido, S., and Clayton, J. Chainer: a Next-Generation Open Source Framework for Deep Learning. LearningSys Workshop in NIPS, 2015.
[34] Goodfellow, I.J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. ICLR, arXiv:1412.6572, 2015.
[35] Athalye, A., Engstrom, L., Ilyas, A., and Kwok, K. Synthesizing robust adversarial examples. ICML, 80:284-293, 2018.
[36] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard R.E., and Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541-551, 1989.
Figure A.1: Network structures for (a) MNIST, (b) CIFAR, and (c) ImageNet experiments.
I performed image discrimination tasks with MNIST [29], and CIFAR-10[30] datasets in addition to the ImageNet task. A LeNet5-like neural network [36] was employed as the baseline and the proposed hierarchical competitive learning on was applied to it. The network consisted of two or three convolutional layers and one fully-connected layer. Each convolutional layer was followed by an activation function, WTA (for the test) or ReLU (for baseline), and max-pooling. The detailed structure of the network for each task is shown in Fig.A.1. I compared test errors of image discrimination results.
B.1 MNIST
I employed an LeNet5-like network for the test networks using only two convolutional layers. Fig.A.1(a) shows the detailed structure of the network. The baseline network consisted of two convolutional and two fully-connected layers. The convolutional layers consisted of 25 and 50 filters with 5 x 5 pixels and were accompanied by ReLU and 2x2 maxpooling. The fully-connected layers consisted of 100 and 10 units.
Both pre-training and fine-tuning employed the training dataset with 50,000 samples, and the validation process used the test dataset with 10,000 samples. The size of a mini-batch was 100, and the averaged cross-entropy was employed for the loss function. All learning processes undertook a conventional SGD method for the weight update, and the learning coefficienct was 0.01. I did not employing weight decay or momentum. The number of iterations for the pre-training using competitive learning and fine-tuning using gradient-based learning are 15,000 and 3,000, respectively. It should be noted that image augmentation was not employed.
Fig.B.1 shows the obtained filters in the first and second convolutional layers. The color of the filter was obtained by dividing the entire filter set into three bins corresponding to red, green, and blue respectively.
Fig.B.1(a,b) were learned using only conventional BP learning, and looks like noise patterns. By contrast, the filter set learned through competitive learning obtained clear spatial structures (Fig.B.1(c,d)). The results show that competitive learning performs stronger representation learning than BP learning.
Fig.B.2 shows the transition of test errors during the fine-tuning process. The proposed method converged much faster than the baseline and achieved almost the same accuracy. Tab.B.1 shows the resultant test error rates for respective methods. Because image augmentation was not employed, the result of the baseline was not outstanding. However, the result from the proposed method is comparable to other biologically motivated learning methods.
(a)
Figure B.1: Obtained filters in MNIST experiments of (a) first and (b) second convolutional layers learned by conventional BP learning. (c,d) those by competitive learning.
Figure B.2: Transitions of test errors during fine- tuning in MNIST experiments.
Table B.1: Test errors on MNIST.
B.2 CIFAR
I employed LeNet5-like neural network with two convolutional layers. The network was almost identical to that of MNIST, but possessed an extension of the number of filters. Fig.A.1(b) shows the detailed structure of the network. The baseline network consisted of three convolutional and two fully-connected layers. The convolutional layers consisted of 32, 32, and 64 filters with 5 x 5 pixels and were accompanied by ReLU and 2 x 2 maxpooling. The fully-connected layers comprised 4096 and 10 units.
Both pre-training and fine-tuning used the training dataset with 50,000 samples, and the validation process employed the test dataset with 10,000 samples. The size of a mini-batch was 100, and the averaged cross-entropy was employed for the loss function.
All learning processes employed the normal SGD method for the weight update, and the learning coefficient was 0.01. I did not use either weight decay or momentum. The number of iterations for the pre-training using competitive learning and fine-tuning using gradient-based learning are 75,000 and 15,000, respectively. Image augmentation was not employed.
Fig.B.3 shows the obtained filter sets in conv1 (b) and conv2 (c) learned through competitive learning. Although BP learning could not achieve clear spatial structure at the filter sets (Fig.B.3(a)), the proposed method can obtain a basis with clear spatial structure for natural images. However, filters obtained by competitive learning were dominated by low spatial frequency because the dynamics were strongly dependent on their frequency of occurrence. Hence, inception[31]-like structure was introduced to acquire features with higher spatial frequencies using small filters.
Fig.B.3(d) shows the filter set with the second convolutional layer learned through competitive learning without the conscience factor (CF) [27]. Only a few filters could obtain a clear spatial structure, while the others were obtained as noise patterns. As this tendency was more pronounced in higher (output side) layers, applying CF is essential when using competitive learning for DNNs.
Fig.B.4 shows the transitions of the test errors during fine-tuning. Competitive learning drastically accelerated the learning speed during the initial iterations (before 2000 iterations), and the test errors were comparable with the baseline.
Tab.B.2 shows the comparison of the final test errors among several methods. The proposed method achieved a comparable score with respect to biologically motivated learning and BP.
(a)
Figure B.3: Obtained filters in CIFAR experiments. Filters in the first layer with 5x5 pixel size learned by (a) conventional back propagation learning or (b) competitive learning; (c) Filters in the second layer learned by competitive learning employing the conscience factor; (d) (c) without the conscience factor.
Figure B.4: Transitions of test errors during fine- tuning in CIFAR experiments.
Table B.2: Test errors on CIFAR.