A Deep Neuro-Fuzzy Network for Image Classification

2019·Arxiv

Abstract

Abstract

The combination of neural network and fuzzy systems into neuro-fuzzy systems integrates fuzzy reasoning rules into the connectionist networks. However, the existing neuro-fuzzy systems are developed under shallow structures having lower generalization capacity. We propose the first end-to-end deep neuro-fuzzy network and investigate its application for image classification. Two new operations are developed based on definitions of Takagi-Sugeno-Kang (TSK) fuzzy model namely fuzzy inference operation and fuzzy pooling operations; stacks of these operations comprise the layers in this network. We evaluate the network on MNIST, CIFAR-10 and CIFAR-100 datasets, finding that the network has a reasonable accuracy in these benchmarks.

1 Introduction

Performance of many real-world problems have been significantly improved by replacing shallow structures with deeper networks. Various problems such as image classification [1], object detection [2], semantic segmentation [3], and sequence modeling [4] had their breakthroughs in recent years by employing deep neural networks. Basically, deep neural networks are stacks of multiple hidden layers or classifiers extracting complex features from inputs; the networks integrate low-, mid-, and high-level features of different layers to improve generalization and performance [5].

Neuro-fuzzy systems are referred to models working based on a combination of neural network and fuzzy systems. In general, in these models, designing of a fuzzy system follows learning procedures of neural networks [6]. Neuro-fuzzy networks are usually applied to function approximation problems such as classification tasks and control systems [7]. Various structures of neuro-fuzzy have been proposed [6]; the most popular neuro-fuzzy model is a 5-layer feed-forward network called Adaptive neuro fuzzy inference system (ANFIS) where parameters of a TSK fuzzy model were calculated based on a neural network framework [8]. The proposed structures for neuro-fuzzy models are shallow as they only model one rule-set of a fuzzy system. Likewise other learning structures, introducing deeper structures in neuro-fuzzy models may improve their performances.

Deep structures have been introduced in neuro-fuzzy frameworks in recent years by chaining stacks of TSK systems [9–12]. Outputs of each level in these models remain in the same space as their original input space throughout the networks using random shift [12] and feature augmentation [11, 10]; the models have been applied for classification of feature-based datasets. For image classification, several works have considered combination of deep neural networks and fuzzy systems [13–15]; in these models, a fuzzy clustering or a fuzzy rule-based system is applied on the features extracted using a well-known deep neural network. However, none of the networks have explored using a deep neuro-fuzzy network as an end-to-end network for image classification.

In this paper, we propose an end-to-end deep neuro-fuzzy network for image classification. Layers in this network are stacks of two new operations developed based on TSK system concepts; we call the operations fuzzy inference and fuzzy pooling operation. To the best of our knowledge, the proposed network is the first end-to-end deep neuro-fuzzy structure for image classification. Experiments on MNIST [16], CIFAR-10 and CIFAR-100 [17] demonstrate that the model has comparable performance for image classification tasks.

The remainder of the paper is organized as follow. In section 2, we provide background on neuro-fuzzy systems and review latest works on deep neuro-fuzzy structures. Our proposed network is presented in Section 3, and our experimental result are in Section 4. We close with a summary and discussion of future works in Section 5.

2 Background Review

2.1 Fuzzy Logic and Neuro-Fuzzy Structures

A fuzzy set is defined as a membership function mapping elements of a universe of discourse (X) to the unit interval [0,1] [18].

For data analysis applications, membership grade can be viewed as a degree of similarity, preference and uncertainty. From similarity perspective, a membership grade can represent degree of compatibility of an element in the universe of discourse with representative elements of A [19].

Fuzzy if-then rules are expression with the form of IF A THEN B where A is a fuzzy set and B is either a fuzzy set or a function of inputs. The fuzzy rules aim to add human-level decision making procedure into a system to capture uncertainty and imprecision of environment. A fuzzy rule-based system is built of multiple fuzzy rules where each of if-then fuzzy rules works as a local descriptor of environment [20, 8].

Neuro-fuzzy systems are architectures to model fuzzy rule-based systems. ANFIS is one of the first and most popular neuro-fuzzy architectures proposed in 1993 [8]; it is a layered feed-forward network based on TSK inferential system. A hybrid of gradient descent and least-square estimation is used to learn the network parameters. A TSK rule set is defined as [8]:

where k = 1, 2, ..., K and K is the number of fuzzy rules, th input variable, is a fuzzy set for ith input of kth rule, And is a fuzzy conjunction operator, and is output of kth rule. The output of the system is calculated as

where and is a membership grade measuring degree of similarity between . Each layer in ANFIS network is described as follows [8]:

• Layer 1: Membership grade of an input is calculated as:

• Layer 2: Firing strength of kth rule is obtained as:

• Layer 3: Firing strength of each rule is normalized as:

• Layer 4: The output of each rule is calculated as:

• Layer 5: Final output of the network is obtained as:

2.2 Deep neuro-fuzzy Structures

There is a small amount of works integrating fuzzy logic and deep learning algorithms. Aviles et.al [21] combined ANFIS and Long-Short Term Memory (LSTM) structure for estimation of the interaction forces in robotic assisted minimally invasive scenarios. [22] proposed a fuzzy restricted Boltzmann machine (FRBM) where parameters in the model are fuzzy numbers. [23] extended FRBM with Pythagorean fuzzy numbers [24] and applied the model for airline passenger profiling, and [25] extended FRBM with interval Type-2 fuzzy numbers [26]. [27] used Pythagorean fuzzy values to express distribution of parameters in a deep denoising auto-encoder and applied it for early warning of industrial accident. [9–12] developed deep fuzzy structures using stack of TSK fuzzy systems. [9] considered each node in a layer as an independent TSK model. [11] fed a random shift of outputs in the previous fuzzy system along with original input as input to the next fuzzy system. [10, 12] augmented output of the previous fuzzy system to the original input and fed it as input to the next fuzzy system. [28] proposed an evolving deep neuro-fuzzy structure for studying dynamic data streams.

Few studies have considered integration of fuzzy systems and deep learning for image classification [13–15]. In [13], first, features were extracted from images using a VGGNet [29], and then a set of fuzzy rule-based layers [20] were applied to classify the images. [14] classified images by applying a fuzzy c-means clustering on the features extracted from a CNN. [15] uses a combination of features extracted from CNN and fuzzy rough c-means clustering for semi-supervised image classification.

3 Proposed Structure

To develop an end-to-end deep neuro-fuzzy network for image classification, we introduce two operations namely fuzzy inference operation and fuzzy pooling operation. Stacks of the operations comprise the network. In this section, we review design of these operations and our deep neuro-fuzzy network.

3.1 Fuzzy Inference Operation

A Fuzzy inference operation models a TSK rule-based system for image analysis. To develope this operation, we work with single-input-multi-output (SIMO) rule-based systems as

where x is input variable from universe of discourse (X), K in number of rules, n is number of outputs, is a fuzzy set for the kth rule defined over X, and is the ith output of kth rule; the outputs are nonlinear functions of inputs. To employ this set of rules for image analysis, we follow fuzzy rule-based method proposed in [20] where premise of the rules can present statements such as “

For image analysis, we consider the universe of discourse as a set of subregions of a given image. Also, we consider a fuzzy set to be a specific pattern. Finally, we consider the membership grade to be the similarity between the subregions and the given pattern. Each fuzzy rule in the proposed rule set captures a different pattern in the image. Semantically, we can rewrite Equation 4 as:

Thus, the proposed model follows a SIMO rule-based system where the single input is the given image, kth rule captures a pattern in the subregions of the image, each output () presents a nonlinear function over the subregions for the kth pattern, and the defuzzification outputs are impacted by combination of patterns specified by fuzzy sets in each rule on the image.

A fuzzy inference operation includes the following three steps.

1. Calculating membership matrix: For each rule, a matrix of membership grades () is calculated where each item in the matrix shows similarity between a subregion in the image, and a fuzzy set (). We employ dot product operation for similarity measurement, and the membership values () are restricted to be in [0, 1].

where K is number of rules. 3. Calculating final outputs: Final outputs of the rule-based system are obtained as

Fuzzy sets (patterns) and parameters of the output function () are learned using gradient descent.

3.1.1 Example 1 - Fuzzy Inference Operation for a 4 × 4 image

The steps for a fuzzy inference operation are described in details in this example. The input is a one-channel image (a) and the fuzzy set is a pattern as w =

corresponding fuzzy rule.

Calculating membership matrix

First, the universe of discourse is created by dividing the image to subregions whose size is same as fuzzy sets; we consider subregions with stride of 1. For each rule, a membership matrix is obtained by calculating similarity between each subregion and the pattern specified by its fuzzy set using dot product as

where is the ith item in the membership matrix of kth rule and shows similarity between a subregion,

shows the membership grade assignment for the image.

Figure 1: Membership matrix calculation

Calculating the membership matrix for each rule is similar to calculating convolution between an image and a filter; in this example, the operation resembles a convolution between a image with a filter of size with stride of 1. The main difference is that the values in the membership matrix are in range [0,1].

To implement membership matrix in the proposed fuzzy inference operation, we use a convolution operation where number of filters is same as number of rules, and size of filters is equal to the size of patterns specified by the fuzzy sets. As the membership grades are between [0,1], the output of the operation is clipped to be between [0,1].

Calculating firing strength

To calculate firing strength of each rule, each item in the membership matrix of kth rule is normalized by its corresponding item in the other membership matrices, i.e. where is ith

item in firing strength,

Calculating final output

The following shows calculations for the ith output of the n outputs (). Output of each rule is calculated as ; each item in is multiplied only to its corresponding subregion. Figure 2 shows the relation between subregions and the firing strength. Thus, the output is obtained as:

Figure 2: Relation between subregions and the firing strength

aaaa

(10) where is a function defined to deal with overlapping subregions. The output can be considered as an element-wise multiplication between ; note that in this example to simplify visualization, we consider as an identity function on the subregions. Figure 3 shows this process.

Figure 3: Element-wise multiplication between image and

resembles a convolution operation between a filter with stride 1 and a padded version of which is padded in height and width with a length as the difference between the image size and the membership grade matrix. In this example, is padded with . Figure 4 demonstrates calculation

Therefore, output of each fuzzy rule is calculated using the following three steps: 1. Pad volve the padded with a filter with the same size as the fuzzy sets 3. Do an element-wise multiplication between the image and the result of step 2. To obtain the final output, over the rules.

Figure 4: calculation using convolution operation

Table 1: Steps for a fuzzy inference operation

1. Obtain membership matrix of each rule by convolving a filter with the image.

2. Clip the values in the membership grade to be in the range of [0,1].

3. Obtain firing strength of each rule by normalizing membership grade using Equation 7.

4. Concatenate membership matrices as depth of a matrix as

5. Pad based on size difference between the image and the membership matrix, and call it

6. To implement g(.), first, a 1*1 filter is convolved with to get the relation of corresponding items in . Then, the obtained featured maps are convolved with filters of the same size as the patterns. To apply this step for all the outputs, , simultaneously, number of filters in this step is considered as the number of outputs, n. We call the output of this step

7. Convolve image with a filter with the same size as fuzzy sets to implement f(x) and call it operation also helps us to work with multi-channel inputs. Number of filters is equal to the number of outputs.

8. Do an element-wise multiplication between

We can implement the proposed fuzzy inference operation efficiently in machine learning frameworks such as Tensorflow [30]. To do so, Equation 8 is modified as . Table 1 shows steps to calculate a fuzzy inference operation for multi-output TSK rule-based system and Figure 5 demonstrates diagram of the fuzzy inference operation.

Figure 5: fuzzy inference operation for TSK rule-based system with multiple outputs

3.2 Fuzzy Pooling Operation

In the fuzzy inference operation, input and output have the same size because number of subregions defined for an image is equal to number of pixels in the image. If we consider fewer number of subregions, we can reduce the size of outputs. For example, for a image, if we have 4 subregions,

Table 2: Steps for fuzzy Pooling operation

we can reduce the size of outputs to ; in other words, the outputs have size ofof subregions. Implementation of fuzzy pooling operation is similar to fuzzy inference. The main difference is that there is no need for the step 5 and part of the step 6 in Table 1. Table 2 shows the steps for fuzzy pooling operation.

3.3 Deep neuro-fuzzy network

To build a deep neuro-fuzzy, we chain stacks of fuzzy inference operation and fuzzy pooling operation. Input to an operation is outputs of a previous operation. For example ,are concatenated in a matrix and passed as input to the next operation.

The proposed network is able to extract local and global features for image analysis. For image classification, the extracted features are passed to fully connected layers for classification task. The network is trained end-to-end using gradient descent. Parameters of filters in fuzzy inference and fuzzy pooling operations along with fully connected layers parameters are learned. A detailed description of learnable parameters of fuzzy inference operation based on the steps of Table 1 is as follow:

For step 1, filters related to fuzzy sets are learnable; from step 2-5 there is no parameter. In step 6, we have parameters for filters, also we consider Leaky-Relu [31] as nonlinearity; for second part of layer 6, we use average pooling instead of convolution operations. In step 7, we have learnable parameters and Leaky-Relu is deployed as nonlinearity, and there is no parameter in step 8.

4 Experiments

In this section, we apply deep neuro-fuzzy networks for classification of MNIST [16], CIFAR10 and CIFAR-100 [17] datasets.

MNIST [16] consists handwritten digits of 0 to 9. The dataset consists of 60k training and 10k testing images of of 10 classes. The last 10k of training images are considered as the validation set. The network designed for MNIST is shown in Figure 6.

Figure 6: Network designed for MNIST dataset

Relu is used as nonlinearity in the fully connected layers, and at the end of the network, a Softmax layer is applied to compute probability of predicted classes. As loss function, cross-entropy function is used. To train the network, we use batch size of 512 on a single GPU, and Adam optimizer [32] is deployed with and . Learning rate is initialized to and divided by 10 after 100th and 300th epochs, we also reduce the learning rate gradually over epochs by 0.9995. The learning rate is reduced over batches in an epoch as well by 0.9995 for the first 100 epoch and by 0.99995 for the rest of training. The best model is selected based on the validation error. Images are normalized to be in the range of [0, 1]; for training images, we apply a simple data augmentation by randomly shifting images horizontally and vertically (10%). Table 3 shows the model performance for MNIST dataset.

CIFAR-10 dataset [17]consists 50k training images and 10k testing images of size in 10 classes; the last 10k of the training images are considered as validation. The network designed for this dataset has 11 layers including FIO (#rules=64, #outputs=32, fuzzy sets= , stride=1) FIO(#rules=64, #outputs=32, fuzzy sets=, stride=1) FPO(#rules=128, #outputs=64, fuzzy sets=, stride=2) FIO(#rules=128, #outputs=64, fuzzy sets=, stride=1) FIO(#rules=128, #outputs=64, fuzzy sets= , stride=1) FPO(#rules=128, #outputs=64, fuzzy sets= , stride=2) FIO(#rules=256, #outputs=128, fuzzy sets= , stride=1) FPO(#rules=256, #outputs=128, fuzzy sets= , stride=2) FL(#units=512, dropout=0.2) FL(#units=512, dropout=0.2) FL(#units=10) where FIO, FPO and FL stand for fuzzy inference operation, fuzzy pooling layer and fully connected layer, respectively.

The model trained for MNIST is used as a pretraining model for this dataset. Leaky-Relu is used as nonlinearity for fully connected layers. For training the model, same batch size and optimizer as the MNIST model is used. The learning rate is initialized to and divided by 10 when validation error plateaus. Between epochs, the learning rate is reduced by 0.9995 and between batches by 0.99994. The best set of parameters for testing is selected by validation error. We apply sample-wise subtraction and sample-wise standard deviation normalization on the training and testing images. For training images, we do a simple data augmentation by randomly flipping images horizontally and randomly shifting images horizontally and vertically (20%). Table 4 shows the results.

CIFAR-100 has the same format as CIFAR-10 with 100 classes. Same as CIFAR-10, we use the last 10k images as validation set. The network designed for this dataset has 11 layers including FIO (#rules=64, #outputs=32, fuzzy sets= , stride=1) FIO(#rules=64, #outputs=32, fuzzy sets=, stride=1) FPO(#rules=128, #outputs=32, fuzzy sets= , stride=2) FIO(#rules=128, #outputs=32, fuzzy sets= , stride=1) FIO(#rules=128, #outputs=32, fuzzy sets=, stride=1) FPO(#rules=128, #outputs=64, fuzzy sets=, stride=2) FIO(#rules=128, #outputs=64, fuzzy sets=, stride=1) FPO(#rules=128, #outputs=64, fuzzy sets=, stride=2) FL(#units=512, dropout=0.2) FL(#units=512, dropout=0.2) FL(#units=10). We train the network same as CIFAR-10. Table 5 shows the comparison results.

Table 3-5 compare performance of our proposed method with the current state-of-the-arts on each of the benchmarks. We have also compared our model with the closest models in terms of accuracy. Table 3 shows our model performance for MNISTcompared to Maxout networks [35] and models based on multi-loss regularization [34]. Table 4 shows the model performance compared to recurren neural network [37] and AlexNet [1]. Table 5 shows our performance compared to models with tree-based priors [38] and network-in-network model [39].

Our preliminary experiments confirm that the deep neuro-fuzzy system performance is comparable to other deep network structures, and it can be a viable structure for image analysis. However, the model does not outperform state-of-the-art on any of the three datasets, which calls for further research on this topic in the future.

5 Conclusions

We have proposed the first end-to-end deep neuro-fuzzy network for image classification. The network is not pushing state-of-the-art results but shows that deep structures based on fuzzy models can be applicable in image analysis. In this paper, we only investigated design of two new operations based on TSK model. However, in future, development of more operations based on TSK model and other fuzzy models such as Mamdani model [19] can be studied for image classification. Moreover, more research on developing regularization methods based on fuzzy models can improve performance of future models.

References

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.

[2] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.

[3] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, pp. 1520–1528, 2015.

[4] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243– 1252, JMLR. org, 2017.

[5] Y. Bengio et al., “Learning deep architectures for ai,” Foundations and trends Rin Machine Learning, vol. 2, no. 1, pp. 1–127, 2009.

[6] D. D. Nauck and A. Nürnberger, “Neuro-fuzzy systems: A short historical review,” in Computational Intelligence in Intelligent Data Analysis, pp. 91–109, Springer, 2013.

[7] W. Pedrycz, A. Kandel, and Y.-Q. Zhang, “Neurofuzzy systems,” in Fuzzy Systems, pp. 311–380, Springer, 1998.

[8] J.-S. Jang, “Anfis: adaptive-network-based fuzzy inference system,” IEEE transactions on systems, man, and cybernetics, vol. 23, no. 3, pp. 665–685, 1993.

[9] S. Rajurkar and N. K. Verma, “Developing deep fuzzy network with takagi sugeno fuzzy inference system,” in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6, IEEE, 2017.

[10] Y. Zhang, H. Ishibuchi, and S. Wang, “Deep takagi–sugeno–kang fuzzy classifier with shared linguistic fuzzy rules,” IEEE Transactions on Fuzzy Systems, vol. 26, no. 3, pp. 1535–1549, 2018.

[11] T. Zhou, H. Ishibuchi, and S. Wang, “Stacked-structure-based hierarchical takagi-sugeno-kang fuzzy classification through feature augmentation,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 1, no. 6, pp. 421–436, 2017.

[12] T. Zhou, F.-L. Chung, and S. Wang, “Deep tsk fuzzy classifier with stacked generalization and triplely concise interpretability guarantee for large data,” IEEE Transactions on Fuzzy Systems, vol. 25, no. 5, pp. 1207–1221, 2017.

[13] X. Gu and P. P. Angelov, “Semi-supervised deep rule-based approach for image classification,” Applied Soft Computing, vol. 68, pp. 53–68, 2018.

[14] M. Yeganejou and S. Dick, “Classification via deep fuzzy c-means clustering,” in 2018 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6, IEEE, 2018.

[15] S. Riaz, A. Arshad, and L. Jiao, “A semi-supervised cnn with fuzzy rough c-mean for image classification,” IEEE Access, vol. 7, pp. 49641–49652, 2019.

[16] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al., “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[17] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” tech. rep., Citeseer, 2009.

[18] L. A. Zadeh, “On fuzzy algorithms,” in fuzzy sets, fuzzy logic, and fuzzy systems: selected papers By Lotfi A Zadeh, pp. 127–147, World Scientific, 1996.

[19] W. Pedrycz and F. Gomide, Fuzzy systems engineering: toward human-centric computing. John Wiley & Sons, 2007.

[20] P. Angelov and R. Yager, “A simple fuzzy rule-based system through vector membership and kernel-based granulation,” in 2010 5th IEEE International Conference Intelligent Systems, pp. 349–354, IEEE, 2010.

[21] A. I. Aviles, S. M. Alsaleh, E. Montseny, P. Sobrevilla, and A. Casals, “A deep-neuro-fuzzy approach for estimating the interaction forces in robotic surgery,” in 2016 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1113–1119, IEEE, 2016.

[22] C. P. Chen, C.-Y. Zhang, L. Chen, and M. Gan, “Fuzzy restricted boltzmann machine for the enhancement of deep learning,” IEEE Transactions on Fuzzy Systems, vol. 23, no. 6, pp. 2163–2173, 2015.

[23] Y.-J. Zheng, W.-G. Sheng, X.-M. Sun, and S.-Y. Chen, “Airline passenger profiling based on fuzzy deep machine learning,” IEEE transactions on neural networks and learning systems, vol. 28, no. 12, pp. 2911–2923, 2017.

[24] R. R. Yager, “Pythagorean fuzzy subsets,” in 2013 Joint IFSA World Congress and NAFIPS Annual Meeting (IFSA/NAFIPS), pp. 57–61, IEEE, 2013.

[25] A. K. Shukla, T. Seth, and P. K. Muhuri, “Interval type-2 fuzzy sets for enhanced learning in deep belief networks,” in 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6, IEEE, 2017.

[26] J. M. Mendel and R. B. John, “Type-2 fuzzy sets made simple,” IEEE Transactions on fuzzy systems, vol. 10, no. 2, pp. 117–127, 2002.

[27] Y.-J. Zheng, S.-Y. Chen, Y. Xue, and J.-Y. Xue, “A pythagorean-type fuzzy deep denoising autoencoder for industrial accident early warning,” IEEE Transactions on Fuzzy Systems, vol. 25, no. 6, pp. 1561–1575, 2017.

[28] M. Pratama, W. Pedrycz, and G. I. Webb, “An incremental construction of deep neuro fuzzy system for continual learning of non-stationary data streams,” arXiv preprint arXiv:1808.08517, 2018.

[29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[30] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system for large-scale machine learning,” in 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp. 265–283, 2016.

[31] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.

[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

[33] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus, “Regularization of neural networks using dropconnect,” in International conference on machine learning, pp. 1058–1066, 2013.

[34] C. Xu, C. Lu, X. Liang, J. Gao, W. Zheng, T. Wang, and S. Yan, “Multi-loss regularized deep neural network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 12, pp. 2273–2283, 2015.

[35] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.

[36] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. V. Le, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” arXiv preprint arXiv:1811.06965, 2018.

[37] F. Visin, K. Kastner, K. Cho, M. Matteucci, A. Courville, and Y. Bengio, “Renet: A recurrent neural network based alternative to convolutional networks,” arXiv preprint arXiv:1505.00393, 2015.

[38] N. Srivastava and R. R. Salakhutdinov, “Discriminative transfer learning with tree-based priors,” in Advances in Neural Information Processing Systems, pp. 2094–2102, 2013.

[39] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.