Run-time Deep Model Multiplexing

2020·Arxiv

Abstract

Abstract

We propose a learning algorithm to design a light-weight neural multiplexer that given the input and computational resource requirements, calls the model that will consume the minimum compute resources for a successful inference. Mobile devices can use the proposed algorithm to offload the hard inputs to the cloud while inferring the easy ones locally. Besides, in the large scale cloud-based intelligent applications, instead of replicating the most-accurate model, a range of small and large models can be multiplexed from depending on the input’s complexity which will save the cloud’s computational resources. The input complexity or hardness is determined by the number of models that can predict the correct label. For example, if no model can predict the label correctly, then the input is considered as the hardest. The proposed algorithm allows the mobile device to detect the inputs that can be processed locally and the ones that require a larger model and should be sent a cloud server. Therefore, the mobile user benefits from not only the local processing but also from an accurate model hosted on a cloud server. Our experimental results show that the proposed algorithm improves mobile’s model accuracy by 8.52% which is because of those inputs that are properly selected and offloaded to the cloud server. In addition, it saves the cloud providers’ compute resources by a factor of 2.85as small models are chosen for easier inputs.

Index Terms—deep neural network, resource-constrained inference, high-performance computing, privacy-preserving inference, edge intelligence, cloud intelligent services, collaborative intelligence, mobile cloud computing

I. INTRODUCTION

Deep learning is the rocket fuel of the recent advances in artificial intelligence and gaining popularity in intelligent mobile applications, solving complex problems like object recognition [1, 2], facial recognition [3, 4], speech processing [5], and machine translation [6]. Although many of these tasks are important on mobile and embedded devices, especially for sensing and mission-critical applications such as health care and video surveillance, existing deep learning solutions often require powerful computational resources to run on. Running these models on mobile devices can lead to long runtimes and the consumption of abundant amounts of resources, including CPU, memory, and power, even for simple tasks [7]–[9]. Besides the enhancements achieved in optimizing the computation graph, efficient storage access such as Computational Storage Devices has shown promising results in further acceleration of deep learning models by reducing the data movements from storage device [10, 11].

The training process of deep neural networks (DNNs) is often offloaded to the cloud as it requires a huge amount of computations on large data. Once the model is trained, it will be used for inference on new unseen inputs. The inference process can be hosted privately on the local devices

Fig. 1: The percentage of ImageNet’s [12] validation set images that can be predicted correctly by a certain model but can not be correctly predicted by another model. As an example, alexnet, as our worst performing model, can correctly predict 2.8% of the inputs that the largest model, resnext101 32x8d, cannot.

or as a public service in the cloud which we call mobile-only and cloud-only inference, respectively. In the cloud-only inference, the cloud providers grant access to the pre-trained models using an Application Programming Interface (API), which receives the input from the user and returns the inference results (predictions). The cloud-only inference is easy to deploy and scale up but compromises the data privacy and needs a reliable network connection. The communication cost of cloud-based inference can be also larger than the computation cost of running a small model locally. On the other hand, the mobile-only inference enables the mobile application to function without network access but is limited to small models due to the lack of enough computing resources.

Recent promising advances in mobile-friendly deep architectures, such as mobilenet v2 [13], is closing the accuracy gap between the mobile and cloud level inference. For instance, the accuracy of mobilenet v2 as a mobile-scale model and resnext101 32x8d as a cloud-scale model are 73% and 79%, respectively. This essentially means that the mobile level model can predict 73% of the inputs locally while the cloud level model can be called for the rest. As a result, a model multiplexer can be designed to call either the local model or the cloud model. However, the cost of this multiplexer should be kept small. We provide the definition of input complexity or easiness/hardness that we use throughout this paper:

• Given a pair of small (mobile-side) and large (cloud-side) models, an input is easy if its label can be predicted correctly by the small model. An input is hard if the prediction is performed correctly by the large model.

• Given an ensemble of N models, the complexity of an input lies in a range between 0 and N representing the number of models that correctly predict the input’s label. In the extreme case, the input complexity is 0, if all models can predict correctly. On the other hand, the input complexity is N if no model can make a correct prediction on it. In the cloud inference services, the best-performing model is replicated across the servers and an API routes the users’ input to one of the hosting servers. However, as we discussed earlier, a large portion of the inputs can be predicted correctly by worse-performing models with fewer computations. Also, a surprising fact is that the small model can predict some inputs correctly that the largest model cannot. For example, as demonstrated in Figure 1, the worst-performing model, alexnet [14], correctly predicts 2.8% of the images that the best-performing model, resnext101 32x8d [15], is not capable of. This suggests that if the multiplexing is performed well, the accuracy can be even higher than the most accurate model. The proper selection of a model for inference can lead to considerable resource usage savings and higher accuracy. In this paper, we present a model multiplexer that receives the raw input (e.g. an image) and outputs a binary vector that shows the models capable of performing the inference. This multiplexer can be used in both mobile devices and cloud hosts. In a mobile application, the output of the multiplexer is a single binary value which decides whether the input should be processed locally or on the cloud. In a cloud service provider, instead of replicating the best performing models, we can host a wide range of different models on servers with different computing requirements and choose them depending on the complexity of the input. The multiplexer is a light-weight neural network extracting the required meta-features to speculate the correctness of the predictions of a set of models. We discuss the related works in the following. Model compression techniques have been proposed to reduce the computational demand often by trading the prediction accuracy. These techniques include quantization [16, 17], pruning [18], optimized convolution operations [13, 19, 20], and knowledge distillation for training small models using the knowledge of a teacher model [21]. Hardware-aware neural architecture search is also a recent interesting and promising research area [22]. These approaches require the user to be expert enough to come up with a specific model that satisfies the prediction accuracy requirements. Our proposed methods in this paper for model multiplexing enables the user to automatically select the model that requires the least resources. Neurosurgeon [23] and JointDNN [24, 25] decides to of-fload some, or all layers in a DNN from the mobile device to the cloud server for reduced latency and mobile energy consumption. Unlike JointDNN, our granularity level is a complete DNN not a group of DNN layers. We seek to minimize the mobile inference latency by running the small models on the mobile side and large models on the cloud side

depending on the hardness of input. Offloading the inference task to the cloud adds the additional cost of communication over a network which can be even larger than the computation cost. Besides, cloud-based inference compromises user privacy. However, if the mobile device can determine the input’s complexity, it can run the inference locally as easy inputs can be solved by a small mobile-friendly DNN. Off-loading the DNN inference computations to the cloud can reduce the inference time [26], however, this is not always applicable because of privacy, communication latency, or connectivity issues. Another similar work [27] uses hand-crafted features such as brightness or edge length in vision applications to choose the best model among a group of models which is highly dependent on the application domain. Furthermore, feature compression techniques are also proposed in prior arts to reduce the costs of uploading the inputs to the cloud server [28]–[30].

Because the level of granularity in model multiplexing is a whole DNN, all acceleration techniques inside a DNN are complementary to our approach. Techniques such as convolutional kernel optimization [18, 31], task parallelism [32], and trading precision for time [33] are used to accelerate the inference time to name but a few. Since a single DNN is not likely to meet all the constraints such as accuracy, latency, and energy consumption across inputs, a strategy to dynamically select the appropriate model to use appears to be a prudent option.

Our approach is also related to ensemble learning where multiple models are used to solve an optimization problem. This technique is shown to be useful on many cognition tasks [34]. However, in ensemble learning a voting mechanism (e.g. weighted mean) is used on all the models’ predictions while our approach only calls a single model.

Figure 2 illustrates the summary of four different scenarios that we addressed: (a) cloud-only inference where the input is always offloaded to the cloud, (b) mobile-only inference where the input is always processed locally, (c) mobile-cloud collaborative inference in which we choose between the mobile and cloud using the proposed multiplexer, (d) as the multiplexing can be done for more than two models, cloud API providers can also use the proposed algorithm to call smaller models instead of always calling the best-performing models. The paper makes the following contributions:

• We present a deep learning-based approach to automatically learn how to multiplex DNN models depending on the input complexity and computational resource requirements. We leverage multiple DNN models and their expertise domain to improve the prediction accuracy and reduce the floating-point operations (FLOPs) and latency.

• The proposed method has a little overhead for the multiplexing as we use a small DNN acting as a pre-processor on the inputs. However, it benefits us by avoiding calling the expensive large models while achieving higher accuracy.

• In the mobile inference, the proposed method enables the mobile devices to perform the easy inference tasks locally and offload the hard ones to the cloud server. Therefore, it preserves the privacy of users for the inputs that are detected as easy.

• In the large scale cloud intelligent services, instead of replicating the best-performing model, one can host a

Fig. 2: Deep learning-powered mobile application deployment options. (a) and (b) show the status quo approaches of cloud-only and mobile-only approaches. In (c) a model multiplexer is called on the input which decides whether the input can be classified correctly on-device or should be offloaded to the cloud due to its complexity. (d) demonstrates multiplexing among a set of models (more than two) in the cloud intelligent service providers.

Fig. 3: The t-SNE visualization of feature space of our benchmark models on the validation set of ImageNet dataset. The feature space of correct and incorrect predictions are highly overlapped. This overlap shows that predicting whether the prediction of a certain model will be correct is a hard task.

range of small and large models and select from them at run-time depending on the input’s complexity which will save the cloud resources by a factor of 2.85.

II. METHODOLOGY

In this section, we explain our proposed algorithm for model multiplexer design. Assume we are given N models to multiplex from. We use a very light-weight mobile-friendly Convolutional Neural Network (CNN), consists of 4 convolutional layers, which outputs N values in the range of [0,1]. The closer the ith value is to one, the more likely it is that the ith model can correctly predict the label. In this section, we explain our proposed method for learning the model multiplexer. The output of the layer before the final classification layer in a deep neural network is a vector referred to as an embedding. The embedding is the essential feature vector of the input learned by a neural network. Therefore, we expect the embeddings of different classes to shape in the space such that they are linearly separable. In Figure 3, we have depicted the projected embeddings of the inputs which are predicted correctly or incorrectly by six different deep model benchmarks. The projection from the high dimensional space of embeddings into two-dimensional vectors is carried out using the t-SNE [35] dimensionality reduction algorithm. Figure 3 shows that there is no separation between the inputs which are predicted

Fig. 4: The target embedding space. The feature maps of the inputs are distributed in the space such that when a group of models can all predict the label of input correctly, their embeddings are close to each other. Also, when a group of models can predict the label correctly while another group of models can not, the distance between their embeddings is increased. This will lead to a feature space similar to a Venn diagram. For instance, the red region on top shows the samples which can be only predicted correctly by model 1.

correctly or incorrectly by a certain model. As a result, using a pre-trained deep model for the model multiplexing without any further supervision would be ineffective. We propose a loss function, referred to as contrastive loss, for jointly training all the models we are multiplexing from. The intuition behind the contrastive loss is that given two groups of models if one group can predict the label of input correctly and the other group cannot, the distance between their embeddings will be increased. Also, when a group of models all can predict an input correctly, the distance between their embeddings will be decreased. This loss function shapes the embedding space of models similar to a Venn diagram. As depicted in Figure 4, for example, the red region on top contains the samples which can be predicted correctly only by Model 1 whereas the gray region in the center is the embedding space of samples which are predicted correctly by all models. The proposed loss is inspired by the Pairwise Ranking Loss [36] in which the distance of representations of the samples is determined by the pairwise similarity of the samples.

Once the models are trained using the contrastive loss, we need to train the model multiplexer using our trained models. As we discussed earlier, given N models, the model multiplexer will have N outputs where the ith output shows the probability that ith model can predict the input correctly. One advantage of using multiple models is that we can also

Fig. 5: Model multiplexer training procedure and its architecture. In the first step, the models we are multiplexing from are trained using the contrastive loss. The contrastive loss allows the learned embeddings to be grouped into regions where each region determines the expertise domain of a subset of models. In the second step, we distill the learned embeddings from the first step into the multiplexer by adding a distillation loss function. The multiplexer outputs a set of weights where each weight determines the confidence of its corresponding model about the prediction correctness. We also show where each loss function is applied to in the figure.

leverage the ensemble techniques. In an ensemble model, a subset of models is selected for the inference and the mean of the selected models’ outputs will be the final prediction. Our training procedure for model multiplexer allows for selecting more than one model for ensembling purposes so as to increase the accuracy. The training procedure of both CNN models using contrastive loss and model multiplexer will be discussed in the following sections.

A. Contrastive Loss Function

We seek to learn the features which are useful for extracting the domain expertise of a group of models. By expertise, we specifically mean the set of inputs that can be predicted correctly by a certain model. In practice, since the embedding vector size of models can be different, we define which will linearly transform the embedding space of ith model into the same dimension and further normalize the linearly transformed embeddings by norm. We call this transformed space projected embeddings. An embedding and a projected embedding of ith model are shown as and , respectively:

Given a pair of models, three cases can happen regarding their capability of correct prediction: 1- Both can predict correctly in which case we decrease the distance between the projected embedding vectors. 2- One can predict correctly whereas the other cannot in which case we increase the distance between the projected embedding vectors. 3- None of them can predict correctly in which we will not apply the contrastive loss and let the cross-entropy loss enable the models to learn the correct prediction without any interference from the contrastive loss. With this explanation, the contrastive loss function, , will be of the form:

Ly, y) =

where y is the true label, is the prediction of ith model, d is a distance function. We may choose d as any family of functions satisfying , where and are embedding space domain. We use the cosine distance for the distance function as following:

Other distance functions in which the output range is normalized to [0, 1] can be used in this formulation, however, we performed the experiments using the cosine distance. We train all the models that we are multiplexing from by adding the contrastive loss to their main loss function which is cross-entropy in our case. Figure 5’s Step 1 demonstrates the learning procedure with the contrastive loss which is applied to all models in the ensemble.

B. Learning the Model Multiplexer

Let denote the learned prediction functions of N deep learning models, where X and y are the input space and target predictions, respectively. Similar to standard stacking [37], we seek to determine a weighted prediction function of the form:

where is the ith model contribution to the final prediction. Let represent the meta-feature extraction function for predicting the correct prediction of ith model, and denote the computing cost of the ith model. The meta-features are supposed to learn the features necessary for determining the weights that corresponds to the likelihood that a certain model can make a correct prediction on the given input. We model as a linear function of the meta-features weighted by the inverse of the computing cost which is FLOPs in our case:

where . To squash into the range of [0, 1], we normalize them using Softmax function. Under this assumptions, Equation 4 can be rewritten as:

We parameterize all with a convolutional neural network and denote its parameters by . As a result, the learnable parameters are and . This formulation leads to the following optimization problem:

where X is the training set. We also add a distillation loss for distilling the projected embeddings of all models learnt by the contrastive loss into the multiplexer. We denote the projected embedding learnt by the ith model as and the ith meta-feature of the model multiplexer as g:

where d is the same function as in Equation 3.

Figure 5’s Step 2 demonstrates the proposed learning algorithm for training the model multiplexer. The complete training process of the models and the multiplexer is demonstrated in Algorithm 1.

C. Multiplexing process

We explained how to train the model multiplexer. The multiplexing can be performed in two ways: 1- We find the maximum weight and call the corresponding model to perform the inference. 2- We select all models whose corresponding weight is greater than a threshold and take the average of their outputs. The whole multiplexing process is shown in Algorithm 2.

Algorithm 2 Multiplexing process

TABLE I: The latency, percentage of local inference, and accuracy of mobile-only, cloud-only and hybrid (multiplexing) methods. mobilenet v2 and resnext101 32x8d are used as the mobile and cloud deep models, respectively.

Setup Flops Latency Mobile Energy Local Acc.

Mobile-only 299M 3.53ms 12mJ 100% 71.88%

A. Experimental Setup

Hardware. We evaluate our approach on the NVIDIA Jetson TX2 embedded deep learning platform as our mobile device. The system has a 64 bit dual-core Denver2 and a 64 bit quad-core ARM CortexA57 running at 2.0 GHz, and a 256-core NVIDIA Pascal Graphics Processing Unit (GPU) running at 1.3 GHz. The board has 8 GB of LPDDR4 RAM and 96 GB of storage (32 GB eMMC plus 64 GB SD card). We use NVIDIA GTX 1080Ti as our server-side hosting GPU. We measure the energy consumption of each component on the board using the INA226 power sensor. We use and set to the average Wi-Fi uplink and downlink speed in the United States [38] for the communication latency. System Software. Our evaluation platform runs Ubuntu 16.04 with Linux kernel v4.4.15. We use PyTorch [39], cuDNN (v7.0) and CUDA (v10.1). Deep Learning Models. We consider six of the state-of-the-art CNN models for image recognition. The models are built using PyTorch and trained on the ImageNet ILSVRC 2012 [12] training set. The total number of floating-point operations required for a single inference is used as the computation cost of the model in Equation 5. We train all the benchmark models and the multiplexer model for 200 epochs on the training set of ImageNet.

B. Results

Mobile-cloud collaborative inference. In this scenario, one light-weight model is hosted on the mobile side (mobilenet v2) and the best-performing model (resnext101 32x8d) on the cloud side. The multiplexer is a 4-layered light-weight CNN adding negligible computation cost compared to the mobilehosted model. Our neural multiplexer outputs a single value between zero and one. Zero means the input should be classified on the mobile device and one means the input should be classified on the cloud server. We use a threshold function at 0.5 to binarize the output. We call the multiplexer to decide whether to perform the inference on the mobile devices or the cloud server. Although a negligible extra computation is added to the mobile inference, it benefits the user with about 10% improvement in the accuracy which is because of the inputs which could be classified correctly only by the cloud’s large and accurate model. In order to have a clear understanding of the components of the latency and energy consumption, we provide their formulations. The latency and energy consumption of a single inference using the mobile-only approach is only due to the computations required for the inference using the mobile-side model (mobilenet v2). We refer to both latency and energy consumption as the cost which is represented by C:

The latency and energy consumption of a single inference using the cloud-only approach consists of the communication costs, and the cloud compute costs:

The latency and energy consumption of a single inference using the hybrid approach has two possible cases: 1- The multiplexer decides to perform the inference locally in which:

2- The multiplexer decides to perform the inference on the cloud in which:

Therefore, the cost of the hybrid approach will be the weighted average of the two previous equations. The weights are determined by the percentage of inferences that are performed on the mobile and cloud. The hybrid approach’s cost will be:

Detailed results for the collaborative inference between the mobile device and cloud server are shown in Table I. As it shows, 68% of the inputs are decided by the multiplexer to be processed locally on the mobile device while the other 32% are offloaded to the cloud. Our algorithm also improves the accuracy of the mobile-only approach by 8.5% which is because of the correct predictions on those inputs that are offloaded to the cloud. The accuracy of the hybrid approach is even higher than the cloud model which is because of the fact the small model can make correct predictions on inputs that the large model cannot. The True Negative Rate of the multiplexer is the detection rate of the inputs that can be classified correctly by the mobile device which is 0.966% in our case. This means we miss (1-0.966)*0.7188=2.4% of the inputs that could be predicted correctly by the mobile device which will be compensated by the powerful cloud model. The latency and energy of the hybrid approach in Table I is worse than those of the mobile-only but this comparison is not fair. The reason is that the extra latency and energy cost we pay is directing increasing the accuracy. Neglecting the cost of multiplexing, the extra latency, and energy is because of two reasons: 1- The inputs that could be predicted correctly on the mobile but we offload it to the cloud which is only the case for 2.4% of the inputs; 2- The inputs that could not be predicted correctly on the mobile and we offload it to the cloud which is the case for 32-2.4%=29.6% of the inputs and is the dominant component.

TABLE II: The FLOPs, latency, accuracy of six of the state-of- art CNN models. The Called column shows the percentage of inputs which are decided to be predicted by the corresponding model.

Model FLOPs Latency Accuracy Called

alexnet [14] 655M 6.8ms 56.55% 10.56% mobilenet v2 [13] 299M 3.0ms 71.88% 18.80% mnasnet1 0 [22] 313M 5.5ms 73.45% 21.80% resnet50 [1] 4.08G 8.9ms 76.15% 14.80% resnet152 [1] 11.5G 11.3ms 78.31% 15.80% resnext101 32x8d [15] 16.4G 11.8ms 79.31% 18.24%

Cloud-based API inference. As in the cloud-hosted inference services the best-performing model is replicated on the servers while many inputs are easy and can be processed with small models. The proposed algorithms help to distribute the easy and hard inputs to the model that will consume minimum resources. Table II demonstrates the improvements we could achieve for the cloud providers. The hybrid-single represents the scenario in which we multiplex a single model from a group of models while hybrid-ensemble represent the scenario in which we multiplex more than one model from a group of models. The models whose associated weight in the Equation 6 is greater than a threshold are selected to perform the inference. We sweep over all possible values for the threshold and found 0.288 as the best value giving the maximum accuracy. Similarly, we also show the cost equation for the hybrid approach of cloud-based inference:

where is the percentage of the times that the ith model is called, and represents the cost of running ith model on the cloud. In the hybrid-single case, the FLOPs count is reduced from 16.4G (i.e. the largest model FLOPs) to 5.75G which essentially results in saving GPU resources by a factor of 2.85. The latency is reduced by 34.5% and the over accuracy is improved by 4.55%. In addition, if we use more than model after the multiplex. i.e. ensembling the models, we can further improve the accuracy. Although ensembling increases the FLOPs, we exploit the fact that model ensembles can be parallelized on GPUs. As a result, the increase (%) in the latency of hybrid-ensemble is less than the increase (%) in its FLOPs.

We demonstrate the effectiveness of the contrastive loss in Figure 6. The learned embedding space is similar to our target Venn diagram style depicted in Figure 4. The inputs which are only in the expertise domain of a certain model are pushed to the boundaries and the inputs which can be predicted correctly by multiple models are closer to the center. The separable embedding space that we create enables a light-weight neural multiplexer to effectively learn the multiplexing function.

IV. CONCLUSION AND FUTURE WORK

In this paper, we present an algorithm to multiplex a deep learning model to use depending on the input complexity and resource budgets. With the proposed algorithm, the mobile devices can host a small and mobile-friendly model and detect the inputs that are likely to be predicted correctly by local inference. Mobile devices will offload the inputs that they find

Fig. 6: The t-SNE visualization of feature space of validation set of ImageNet dataset for the benchmark models trained using the proposed loss function. Left: mobile-cloud collaborative inference using mobilenet v2 on the mobile side and resnext101 32x8d on the cloud side. Right: Ensemble of six benchmark CNNs which is suitable for cloud based intelligent services which host the replicas of the most-accurate model. For instance, instead of replicating resnext101 32x8d on six different servers, one can host these six CNNs plus the multiplexer which achieves less compute resource usage and higher accuracy.

hard to the cloud servers to be inferred by the larger models hosted in the cloud. The communication cost of the cloud-based inference dominates the local inference computation cost. As a result, it is desirable to offload as little as possible to the cloud and meet the accuracy requirements at the same time. Our results show that a user only needs to offload 32% of the inputs to the cloud while achieving an accuracy even higher than the cloud-hosted model. Furthermore, the cloud providers offering APIs for cognitive tasks replicate their best-performing model in the server to called for any inputs regardless of their level of complexity. However, with this approach, they can host a wide range of small and large models and choose one depending on the input. It will save 2.85x of the cloud provider’s compute resources while improving the accuracy by 4.55% compared to deploying the most accurate model.

REFERENCES

[1] K. He, X. Zhang et al., “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.

[2] J. Donahue, Y. Jia et al., “Decaf: A deep convolutional activation feature for generic visual recognition,” ArXiv, vol. abs/1310.1531, 2013.

[3] O. M. Parkhi, A. Vedaldi et al., “Deep face recognition,” in BMVC, 2015.

[4] Y. Sun, Y. Chen et al., “Deep learning face representation by joint identification-verification,” in NIPS, 2014.

[5] D. Amodei, S. Ananthanarayanan et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in ICML, 2015.

[6] D. Bahdanau, K. Cho et al., “Neural machine translation by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.

[7] A. Canziani, A. Paszke et al., “An analysis of deep neural network models for practical applications,” ArXiv, vol. abs/1605.07678, 2017.

[8] M. Samragh, M. Javaheripi et al., “Encodeep: Realizing bit-flexible encoding for deep neural networks,” ACM Trans. Embed. Comput. Syst.

[9] M. S. Abrishami, A. E. Eshratifar et al., “Efficient training of deep convolutional neural networks by augmentation in embedding space,” in 2020 21st International Symposium on Quality Electronic Design (ISQED), 2020, pp. 347–351.

[10] A. HeydariGorji, S. Rezaei et al., “Hypertune: Dynamic hyperparameter tuning for efficient distribution of dnn training over heterogeneous systems,” ArXiv, vol. abs/2007.08077, 2020.

[11] A. HeydariGorji, M. Torabzadehkashi et al., “Stannis: Low-power acceleration of deep neural network training using computational storage,” ArXiv, vol. abs/2002.07215, 2020.

[12] J. Deng, W. Dong et al., “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009.

[13] A. G. Howard, M. Zhu et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” ArXiv, vol. abs/1704.04861, 2017.

[14] A. Krizhevsky, I. Sutskever et al., “Imagenet classification with deep convolutional neural networks,” NIPS, 2012.

[15] S. Xie, R. B. Girshick et al., “Aggregated residual transformations for deep neural networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5987–5995, 2016.

[16] S. Han, J. Pool et al., “Learning both weights and connections for efficient neural network,” NIPS, 2015.

[17] M. Rastegari, V. Ordonez et al., “Xnor-net: Imagenet classification using binary convolutional neural networks,” in ECCV, 2016.

[18] S. Han, X. Liu et al., “Eie: Efficient inference engine on compressed deep neural network,” ISCA, 2016.

[19] F. N. Iandola, M. W. Moskewicz et al., “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 1mb model size,” ArXiv, vol. abs/1602.07360, 2017.

[20] P. Georgiev, S. Bhattacharya et al., “Low-resource multi-task audio sensing for mobile and embedded devices via shared deep neural network representations,” IMWUT, 2017.

[21] G. E. Hinton, O. Vinyals et al., “Distilling the knowledge in a neural network,” ArXiv, vol. abs/1503.02531, 2015.

[22] M. Tan, B. Chen et al., “Mnasnet: Platform-aware neural architecture search for mobile,” CVPR, 2018.

[23] Y. Kang, J. Hauswald et al., “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in ASPLOS, 2017.

[24] A. E. Eshratifar, M. S. Abrishami et al., “Jointdnn: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Transactions on Mobile Computing, pp. 1–1, 2019.

[25] A. E. Eshratifar and M. Pedram, “Energy and performance efficient computation offloading for deep neural networks in a mobile cloud computing environment,” in Proceedings of the 2018 on Great Lakes Symposium on VLSI, 2018, p. 111116.

[26] S. Teerapittayanon, B. McDanel et al., “Distributed deep neural networks over the cloud, the edge and end devices,” 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), 2017.

[27] B. Taylor, V. S. Marco et al., “Adaptive deep learning model selection on embedded systems,” in LCTES, 2018.

[28] A. E. Eshratifar, A. Esmaili et al., “Bottlenet: A deep learning architecture for intelligent mobile cloud computing services,” in 2019 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), July 2019, pp. 1–6.

[29] A. E. Eshratifar, A. Esmaili et al., “Towards collaborative intelligence friendly architectures for deep learning,” in 20th International Symposium on Quality Electronic Design (ISQED), March 2019, pp. 14–19.

[30] H. Choi and I. V. Baji, “Deep feature compression for collaborative object detection,” in 2018 25th IEEE International Conference on Image Processing (ICIP), Oct 2018, pp. 3743–3747.

[31] S. Bhattacharya and N. D. Lane, “Sparsification and separation of deep learning layers for constrained resource inference on wearables,” in SenSys, 2016.

[32] N. D. Lane, S. Bhattacharya et al., “Deepx: A software accelerator for low-power deep learning inference on mobile devices,” 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), 2016.

[33] H. N. Loc, Y. Lee et al., “Deepmon: Mobile gpu-based deep learning framework for continuous vision applications,” in MobiSys, 2017.

[34] O. Sagi and L. Rokach, “Ensemble learning: A survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2018.

[35] L. van der Maaten and G. E. Hinton, “Visualizing data using t-sne,” in Journal of Machine Learning Research, 2008.

[36] W. Chen, T.-Y. Liu et al., “Ranking measures and loss functions in learning to rank,” in NIPS, 2009.

[37] L. Breiman, “Stacked regressions,” Machine Learning, vol. 24, no. 1, pp. 49–64, Jul 1996.

[38] Ookla, “2019 Speedtest U.S. Mobile Performance Report,” https://www. speedtest.net/reports/united-states/, 2019.

[39] A. Paszke, S. Gross et al., “Automatic differentiation in pytorch,” 2017.