Identifying malicious software (malware) on a host machine is a critical task in maintaining a system’s integrity and the integrity of the work performed on that system. Intrusion detection systems (IDSs)—such as anti-virus software—are used to identify, assess, and report any unauthorized programs on a system. Malware authors use various techniques to evade detection by an IDS such as changing which registers are used, changing machine instructions to equivalent ones, reordering independent instruction blocks, and inserting nooperation instructions [23]. Signature-based approaches used by many IDSs are static and unable to adapt to the dynamic approaches implemented in malware, other than by repeatedly adding new signatures manually. As the number of computing devices increases, especially those used to process sensitive tasks (e.g., banking, health care, and infrastructure), it is imperative to detect compromised systems as soon as possible.
Despite exploiting different vulnerabilities and employing obfuscation techniques, most malware exhibits common behavior. For example, once a system is exploited, malware will often beacon out to a command and control server or clean up log files to cover its tracks or another such activity as depicted in the cyber kill chain for advanced persistent threats [8]. In addition to the behavioral extent of malware, it is also important to consider the implications of concept drift wherein a target distribution (in this case, the behavior of malware) is non-stationary and changes over time.
We examine machine learning (ML) algorithms (random forests (RFs), deep learning approaches and liquid state machines (LSMs)) to detect malicious behavior using system call traces (calls to functions provided by the operating system—see Figure 1). We examine these methods using a concept drift scenario, whereby training data precedes test data collected from a corporate gateway. We observe that these algorithms are able to distinguish between malware and goodware with an average class-averaged accuracy (CAA) of 93% with a malware precision of 93% and 88% recall.
ML for identifying malware has several unique challenges. Explicitly, malware authors actively try to
Figure 1: Process for generating the system call data. a) System calls are functions called from an executable to the underlying operating system. b) The system calls are intercepted and saved in a file and then converted to a multi-hot encoding (MHE) of the system call traces.
masquerade their malware as goodware. There is also a scaling issue in the sheer amount of data samples, the size of each data sample, and class imbalance. In operational networks of large corporations, there will be large amounts of executables observed across the network with less than 1% of them being malware. We structure experiments to investigate these issues and offer insights from traditional training schemes.
An induced model captures the generalizations of the underlying data that it is modeling. We explore the use of ML explainability techniques to understand the characteristics of malware and identify behaviors that lead to its classification as malware.
Our contributions include: 1) a comparison of several ML models in malware identification using system call traces on real-world data, 2) an analysis of malware including which features are the most important for identifying malware, and 3) practical insights in how to apply these results in real-world scenarios and the implications of using each technique.
Several previous studies have shown promising results using ML algorithms to detect malware [16] or to differentiate between different families of malware [9]. There are two common approaches to extract features from malware: static analysis and dynamic analysis [6]. Static analysis refers to extracting statistics from the meta-information of an executable without running the executable, such as a list of DLLs in the binary [19] or byte n-grams [10]. Features from static analysis are vulnerable to obfuscation techniques such as code transformation techniques, but have the advantage that malware never has to be executed. Feeding these features into various ML algorithms produced good results showing the benefits of using ML to detect malware. Despite this success, Kruegel et al. showed that advanced, semantic-based malware detectors can be evaded using obfuscation techniques commonly employed by malware authors. They concluded that static analysis techniques alone are not sufficient to identify malware [11].
Dynamic analysis, on the other hand, runs an executable to extract features. Dynamic analysis, in principle, should be less vulnerable to obfuscation as it extracts features from the behavior of an executable. Several previous works used Markov chains to model system call sequences. Ravi and Manoharan use thirdorder Markov chains to model the system call traces and achieve better detection rates than support vector machines, decision trees, and na¨ıve Bayes [17]. Anderson et al. extract Markov chains of the instruction traces as features. Graph kernels are then used to create a similarity matrix which is then passed to a support vector machine for classification [3].
The success of neural networks in other application areas has motivated the use of neural methods to classify malware. For example, Nataraj et al. treat executables as gray-scale images and use well proven image processing techniques to classify malware [14]. Tobiyama et al. use long short-term memory neurons to learn a language model of the malware based on their system calls using unsupervised learning [22]. The output from this learned language model is then fed into a convolutional neural network for classification. These works report accuracies of 96-98%; however they only experimented using cross-validation, discarding any temporal ordering and not validating how the models handle concept drift. Our work builds on the success of the previous works, provides a comparison of several methods, and examines how the investigated methods handle concept drift.
System calls are the standardized programmatic pathways that allow programs to interact with the operating system. Programs use system calls to request and manipulate computer resources controlled by the operating system, such as files, memory and network connections. The data used for this analysis is a sequence of system calls made to the operating system from an given executable Figure 1 a shows this process. The executables were gathered from two sources. The benign executables were pulled from the gateway of a corporate network under the assumption that the majority of the downloaded executables are benign. These executables were run through several anti-virus programs and, if none hit, then the sample was considered benign. Of course, it is possible that there are malware samples in the collected executables with some probability. This is consistent with real world situations where labels are not precise nor available for all samples. The malware samples were gathered from daily feeds from Arbor Networks, a cybersecurity company that maintains a repository of malware.
The data was collected over the course of 2012. All samples are Windows 32-bit executables. Each sample was executed in a hypervisor environment (a platform for running virtual machines) for a given period of time and the system calls made by the top-level process were collected. In total, 14,483 samples were collected from 6,197 benign executables and 8,286 malware samples. In our analysis, we only analyzed the original, primary process, ignoring any spawned processes. Future work will include an analysis of the spawned processes in addition to the parameters that are passed to the system calls.
Generally, complexity for the sequence learning methods increases with the length of the system call trace. We limit our analysis to look at the first n system calls made by an executable. Most analyses are done on the first 1000 system calls. An analysis of the length of the system call traces is provided in Section 6.1.
As multiple system calls can occur in a time step (the granularity of each time step is one millisecond), we use a multi-hot encoding scheme as depicted in Figure 1b. In multi-hot encoding, a vector the length of the number of unique system calls is initialized with all zeros. The number of times each system call made during a time step is put into the vector at the index representing the given system call. With this base data set, we allow each algorithm to further process the data to highlight its strengths.
4.1 Histograms and Random Forests
(Hist+RF) One way to encode the time series data for learning is to use histograms. Each feature represents a particular system call and the value corresponds to the number of times that call was made for a given system call trace. With this encoding, any supervised ML algorithm can be applied to the resulting dataset. We use this encoding as input to a random forest (RF) [4].
This encoding does not capture the temporal information—the order in which system calls were made—of the data. It can be modified to encode some ordering by using of n-grams, where each feature becomes a unique sequence of n system calls present in the data. However, the n-gram encoding increases both the dimensionality and the sparseness of the data as n is increased, which can make learning more difficult. We do not examine the use of n-grams here.
4.2 Deep Learning Methods Deep neural net-
works have been shown to be effective for pattern recognition in images, speech, and text. We examine the use of convolutional neural networks (CNNs) and long short-term memory (LSTM) recurrent neural networks (RNNs). CNNs are essential elements of state-of-the-art systems for classifying objects in images and LSTMs are used in state-of-the-art text/natural language processing applications. CNNs and LSTMs are used together in state-of-the-art speech recognition systems to capture the sequence of spectral patterns over time from speech data [2]. We examine the effectiveness of CNNs, LSTMs and a combination of the two for detecting malware from system call traces.
4.2.1 Convolutional Neural Networks (CNN)
A convolutional layer in a deep neural network learns patterns of local structure in the input signal. Subsequent convolutional layers learn combinations of features detected in previous layers. Thus, the CNNs can learn feature representations over a sequence of input data [12]. The final layer of the CNN classifies an input sequence as goodware or malware as a function of the high-level system call structure detected over the entire sequence. The system calls are translated into integer values in a one-dimensional vector and are then processed by a CNN with one-dimensional convolutional layers.
We tested three different architectures with a variety of kernel sizes. Each architecture ran for 60 epochs with kernels of five, seven, and ten elements. All networks used categorical cross entropy as the loss function and used 32, 64, or 128 filters. We used 1) a CNN with two, one-dimensional convolutional layers followed by pooling, dropout, and dense layers, 2) a pure CNN with five convolutional layers, and 3) a hybrid CNN with five convolutional layers separated by batch normalization layers. From empirical analyses, the hybrid convolutional network had the best performance and fastest training. Only the results from the hybrid model are reported.
4.2.2 Long Short-Term Memory (LSTM) For
the LSTM [7], each system call trace includes a timeordered sequence of system call IDs on which an LSTM network can detect temporal (sequence) patterns that are important for discriminating between goodware and malware. One and two layers of LSTM neurons with various numbers of neurons per layer were explored. Each node learns a different sequence pattern and the collection of sequence pattern detectors from all the nodes connected to the output layer are used to classify each system call log.
4.2.3 Combined Convolutional and Recurrent Neural Network (CNN+LSTM) Combining layers
of a CNN with LSTM layer(s) has been shown to be a powerful classifier that learns temporal patterns in sequences of local structure. Convolutional layers can present a sequence of higher-level features to an LSTM layer, which often leads to superior performance than an LSTM presented with raw sequence data. Thus, we examine using a convolutional layer before feeding the input into 1 or 2 LSTM layers.
4.3 Liquid State Machines (LSM) The LSM [13]
is a neural-inspired algorithm that mimics the cortical columns in the brain. LSMs are composed of three general components: 1) input neurons, 2) randomly connected leaky integrate-and-fire spiking neurons (LIF) called the liquid, and 3) readout nodes that read the state of liquid. The liquid functions as a temporal kernel, casting the input data into a higher dimension and the LIF neurons allow for temporal state to be carried from one time step to another. We use a liquid of 135 neurons where the inputs are randomly connected to 30% of the neurons in the liquid.
The readout neurons are the only neurons that have plastic synapses, allowing for synaptic weight updates via training. Any classifier can be used, but often a linear classifier is sufficient. We use a support vector machine with a radial basis function kernel to train the readout neurons. The sigma and box parameters for the kernel are chosen using Bayesian optimization minimizing the 10-fold cross validation loss.
One benefit of using a liquid state machine is that it can be run on neuromorphic hardware, which will significantly reduce the computational time and power consumption [20].
To evaluate each algorithm described in Section 4, we report the accuracy (Acc), the class averaged accuracy (CAA), and, for the malware class, the precision (MPr) and recall (MRe). The CAA is the average of the accuracy for each class. We split the data into training and test sets maintaining temporal ordering (the instances in the training set were observed before the instances in the test set). The temporal ordering allows for a test of how well the models handle concept drift. The distributions for the goodware and malware are shown in the sorted column in Table 1. The distributions in the distributed column are used in a later experiment mimicking operational network traffic characteristics.
In current IDSs, the number of false positives can and often does overwhelm an analyst. The analyst often has to manually investigate any alerts from the
Table 1: The data distributions used in the distributed (down-sampled) and sorted data sets. Both of these sets preserve temporal ordering between the training and test sets.
Table 2: The accuracy (ACC), class averaged accuracy (CAA), malware precision (MPr) and malware recall (MRe) on sequence lengths of 1000. The highest value for each metric is bolded.
IDS to understand if a compromise has occured. Thus, reducing the number of false positives by an order of magnitude significantly reduces the work load for an analyst. Also, intrusion detection requires high recall as only one vulnerability needs to be exploited for an adversary to accomplish his or her objective. Thus, the requirements for malware detection reach far beyond classification accuracy.
The results for examining the first 1000 systems calls are shown in Table 2 with the highest performing methods highlighted in bold for each metric. Each method is able to distinguish between malware and goodware with 90% CAA or greater. Overall, the Hist+RF achieves the highest measures and is comparable to using a voting ensemble of all the methods (including Hist+RF). This result is surprising as we anticipated that the sequence learning methods would outperform a histogram representation of the data. We provide a further analysis of this result in Section 7.1.
The results are statistically significant using Cochran’s Q test—a non-parametric test for measuring the differences between three or more measurements.
Table 3: Pair-wise comparison of the investigated al- gorithms for statistical significance using the McNemar test with an alpha-value of 0.05 and ˇSid´ak correction.
(The null hypothesis is that there is no difference between classifier outputs.) With the null hypothesis rejected, we test the significance between pairs of algorithms using the pairwise Cochran’s Q test—equivalent to McNemar’s test—and use the ˇSid´ak correction to control for the family-wise error rate.
Table 3 shows which pairs of algorithms are differ-ent with statistical significance. With an alpha value of 0.05, the ensemble is statistically significantly different from all of the other methods. Despite the Hist+RF achieving higher accuracy, it is only significantly differ-ent from the LSTM. Interestingly, the LSTM is signifi-cantly different from all of the other algorithms except the LSM. This may be due to the neurons in both algorithms maintaining temporal state. The CNN+LSTM is significantly different from the LSTM and LSM. This gives some insight into the power of the input representation from the convolutional layer in CNNs and CNN+LSTMs.
The success of CNNs on system call log classifi-cation was surprising given that there is no identifi-able, quantitative relationship between neighboring system call IDs as there is with image pixels, for example. Nonetheless, repeated, local, discriminative patterns exist in the log files that the CNNs are able to learn. The poorer performance of networks consisting only of LSTM layers may be due to the embedding used and/or inadequate training data to learn the temporal patterns necessary for good classification. For high-performing object recognition in images, state-of-the-art models use data augmentation where there are instances of the same object with different aspects including lighting, scales, and orientations.
For sequence data, the problem can be compounded by a separation in time or the re-ordering of important sequence structures. It is possible that some types of malware and goodware did not have a rich enough representation in the training set for the LSTMs to adequately learn to recognize them. Data augmentation,
Figure 2: The CAA over different lengths of the system call traces varying between 100 and 5000.
which we did not do, is often a necessary element to creating a high-performance pattern recognizer. Data augmentation is difficult and can be especially difficult with cyber data since exact values can be important and should not be perturbed by random noise. An ideal situation would be to know the sequence patterns that discriminate between malware and goodware and to surround those patterns with various types of background data and separate them by random lengths of time.
Overall performance of the CNNs hint at a potentially larger feature space than the dimensions of the selected kernel for the CNN. The best CNN performance resulted from a hybrid network with five layers, downsampling, and normalization between each layer. This strategy provides some amount of resistance to noise while allowing the CNN to detect features over longer sequences. This also seems to align with the performance of Hist+RFs.
The high performance of RFs was surprising. RFs are a simpler approach for classification and are easier to train and maintain than neural techniques. All approaches perform well, however, and could be used in a variety of domains based on specific domain needs.
6.1 System Call Length We examined how the
sequence length affects the overall performance. The CAAs for each algorithm are shown in Figure 2. With only 100 system calls, the LSM has the highest CAA at 90%. The CAA for the LSM stays around 90% regardless of the sequence length.. The CAAs for the other methods increase as the number of system calls increases up to a length of 1000. The fact that the LSM is able to achieve high CAA with only 100 system calls may be due to the signal averaging out as the number of system calls increases. It is also interesting that with
Table 4: A comparison of results from diffenent evalua- tions of each algorithm: a) sorted data set with roughly balanced goodware to malware ratio, b) 10-fold cross-validation, and c) a test set with significant class skew.
only 100 system calls, malware can be identified with high accuracy.
Including more than 1000 system calls does not provide a significant improvement in CAA. Thus, we can surmise that sufficient information is provided in the first portion of an executable for differentiating between goodware and malware.
6.2 Generalizing the Results to Real-World
Scenarios Up to this point, the examined data set is temporally structured, but it contains a fairly balanced distribution of goodware to malware in the test set. While this allows for an evaluation of a broad spectrum of malware samples, it is not representative of the distribution of malware found in operational networks where the ratio of malware to goodware is much lower. To address this, we also created a data set with a skewed data set as shown in Table 1 in the distributed column (a random down-select).
In typical operational situations, ML algorithms are first evaluated on a training set—often balanced as done here as class skew has been shown to exacerbate the effects of other characteristics causing misclassifications [21]. A common approach for evaluating approaches is to use n-fold cross-validation. However, this can provide an overly optimistic evaluation of the performance as temporal ordering is removed. Deploying to a different distribution than what was used for testing can result in significantly different performance.
Table 4 shows the results of evaluating the inves-
tigated algorithms using the sorted data set that we have been examining, 10-fold cross-validation, and a distributed data set with significant class skew in the test set as might be observed in an operational network. Generally, cross-validation achieves better metrics than the sorted data set. The cross validation results do not take concept drift into account in the performance metric as the temporal ordering is not preserved. This shows that concept drift is an important aspect to take into account when developing models for malware detection. Also, if using cross-validation, the same performance levels should not be expected in operational settings.
For the distributed data set, the CAA is often similar to the cross-validation and the sorted data set but precision and recall on the malware is significantly different—recall is close to 1 and precision is very low. The low precision is due to the fact that the ratio of malware to goodware is lower in the distributed scenario. In our case, we have 45 malware and 4728 goodware samples. If only 3% of the goodware is misclassified, then 142 samples are misclassified and the malware precision is 24%. For the sorted scenario with more malware samples, the malware precision changes to almost 95%. Thus, the effect of class skew alone has dramatic affects on the results despite achieving good results from cross-validation and using the sorted data set.
7.1 Feature Representation Our results have
shown that ML is a viable option for identifying malware from system call traces. We had hypothesized that the more sophisticated sequence learning methods would outperform the Hist+RFs. To further investigate why the Hist+RF perform as well as it does, we test whether using a RF and/or representing the data with histograms has a significant affect on the results. While the Hist+RF was not statistically significant when compared to the other neural algorithms (other than the LSTM) it has a much lower training complexity than the neural methods.
To investigate if the RF caused the high accuracy, we took the output from the neural methods before they are fed into the last layer and used them as input to a RF. This tests whether the RF provides more discriminatory power than the linear classifier found at the last layer in a neural network. It also provides insight into whether the deep learning methods are able to automatically extract higher-order features from the system call traces as has been shown in other domains.
The results for using the output from the last layer of the neural methods as input to a RF are shown in
Table 5: The CAA for a RF trained on the outputs before the output layer for neural methods. The right column gives the original CAA for the algorithm.
Table 5. The RF column represents using the output from the neural models as the features input to a RF. The Non-RF column refers to using the original classifier of the algorithm (linear classifier). The results show that the CAAs decreases when a RF is used as the classifier, although only by about 1%.
We also test whether the histograms were able to achieve better results using a linear classifier. Using the histograms as input to a linear classifier results in a decrease of 10% CAA from 95.3% to 85.0%.
While the system call sequences describe the behavior of an executable, there are several methods for achieving the same functionality in an executable. Malware authors take full advantage of that fact to obfuscate the behavior of their code including adding spurious system calls and changing the order of the system calls. This makes it difficult when trying to learn using the sequence of system calls. The space of system call sequences is very large, yet the valid sequences of system calls (sequences that execute a valid process) are sparsely distributed throughout the space—especially those for doing malware. The space is not continuous, making it difficult for gradient-descent based optimization-based methods to generalize. Given more data and a variety, the sequence learners may have been able to produce better results. These constraints may have played into why Hist+RF performed as well as it did.
7.2 Characterizing Malware In addition to iden-
tifying, it is important to also understand what the malware is doing and why it is classified as malware. There are several methods to explain what the model has learned from techniques such as feature importance and explainability.
The Gini importance value measures the importance of each feature from an induced RF [5]. We use the implementation provided in the Python package scikit-learn [15]. The 20 most important features (system calls) using the RF with histograms are shown in Table 6. The Goodware and Malware columns represent if the system call was of also one of the 20 most frequently called system call relative to the other class.
Table 6: The most importance features using the Gini importance values in Hist+RF. The Goodware and Malware columns represent whether a system call was more likely to be called by goodware or malware.
The most discriminating system calls deal with file IO and virtual memory allocation. Malware issues NtFsControlFile system calls more frequently than goodware. This system call “sends a control code directly to a specified file system or file system filter driver, causing the corresponding driver to perform the specified action” [1]. Thus, we can see that, in general, malware may try to be more specific with which resources it is using. Incorporating the parameters sent with system calls would provide more details about its behavior.
The Gini feature importance measure provides a global overview of the characteristics of malware and goodware. We also examine the important features for each individual prediction using Local Interpretable Model-Agnostic Explanations (LIME) [18]. The top 15 averaged feature importance from LIME for correctly identified malware and misclassified malware are shown in Figure 3a and 3b, respectively. The green bars to the left are feature importance values for malware and the red bars to the right are feature importance values for goodware. The standard deviation error bars indicate how much variance there is when calculating the mean. Several of the features overlap with those found by analyzing global feature importance.
The LIME explanations for the first three features are the same for the correctly classified and misclassified
a b
Figure 3: The most important features on average with standard deviation error bars using LIME for a) correctly classified malware samples and b) misclassified malware samples. The length of the bar represents how influential it is for goodware or malware. Green and to the left for malware and red and to the right for goodware.
Figure 4: Example rule from an induced decision tree.
malware samples with little variance. The explanations diverge starting with the fourth and fifth features, again, with little variance for correctly classified and misclas-sified samples. This is important to note because the fourth and fifth features for the misclassified malware indicate that the features change the classification to goodware. Malware authors could use this information to better obfuscate their malware while those protecting networks can use this information to improve their models. This information could also be used for remediation purposes. For a given sample, knowing what the malware was doing that led to it being classified as malware can provide a starting point for a forensic investigation.
Decision trees (DTs) can be used to create a set of rules to better understand the structure in the data. We train a DT on output from the RF and examine the splits that are induced by the DT. Rules, such as the rule shown in Figure 4, could be extracted from the DT and used to supplement other IDSs such as SNORT.
Examining the DT, the first split is on NtSetInformationFile and the majority of instances with that one split are malware (2666:176 malware to goodware). Following the tree down further provides finer granularity. Knowing which system calls or combinations of system calls have high discriminatory power can be very powerful for defending a system, identifying vulnerabilities, and mitigating their risk. Using ML models with a domain expert could help to harden computer systems against malware.
7.3 Algorithmic Considerations The discussion
thus far has focused on the histogram represenation of the system calls. Explaining sequences of inputs is not as well established. In addition, there are algorithmic constraints that should be considered when deciding which algorithms to use.
The neural methods inherently face a computational bottleneck with the vector-matrix multiply. The parameter space (number of weights) is extremely large, especially as the number of neurons increases. In addition, there is a large hyper-parameter space (e.g. architecture of the network, activation functions, momentum, and dropout) that has a significant impact on the induced model. Hist+RF is relatively simple algorithmically compared to the neural methods. The neural methods also require large amounts of data to learn an effective model. With only 10,000-13,000 training examples from a complex space and some with very few system calls (short sequences), the neural methods were not able to perform as well. Their could possibly be improved by using unlabeled data for pre-training.
In this paper, we examined several ML methods as part of a dynamic analysis of executables for detecting malware. Each proved to be a viable solution achieving over 90% CAA. We examined techniques for characterizing the malware using feature importance and explainability techniques. Extending these techniques to data sequences could help better characterize malware and how to mitigate its effects on a system. The results and techniques presented here can serve as a step to improving security against malware.
[1] ZwFsControlFile routine description. https: //msdn.microsoft.com/en-us/library/windows/ hardware/ff566462(v=vs.85).aspx. Accessed: 2017-10-05.
[2] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, and et al., Deep speech 2 : End-to-end speech recognition in english and mandarin, in Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger, eds., vol. 48 of Proceedings of Machine Learning Research, New York, New York, USA, 20–22 Jun 2016, PMLR, pp. 173–182.
[3] B. Anderson, D. Quist, J. Neil, C. Storlie, and T. Lane, Graph-based malware detection using dynamic analysis, Journal in Computer Virology, 7 (2011), pp. 247–258.
[4] L. Breiman, Random forests, Machine Learning, 45 (2001), pp. 5–32.
[5] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees, Wadsworth and Brooks, Monterey, CA, 1984.
[6] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, A survey on automated dynamic malware-analysis techniques and tools, ACM Comput. Surv., 44 (2008), pp. 6:1–6:42.
[7] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, 9 (1997), pp. 1735–1780.
[8] E. M. Hutchins, M. J. Cloppert, and R. M. Amin, Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains, Leading Issues in Information Warfare & Security Research, 1 (2011), p. 80.
[9] B. Kolosnjaji, A. Zarras, G. Webster, and C. Eckert, Deep Learning for Classification of Malware System Call Sequences, in Australasian Joint Conference on Artificial Intelligence, 2016, pp. 137–149.
[10] J. Z. Kolter and M. A. Maloof, Learning to detect and classify malicious executables in the wild, Journal of Machine Learning Research, 7 (2006), pp. 2721– 2744.
[11] C. Kruegel, E. Kirda, and A. Moser, Limits of Static Analysis for Malware Detection, in Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC), 12 2007.
[12] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, in Proceedings of the IEEE, 1998, pp. 2278–2324.
Real-time computing without stable states: A new framework for neural computation based on perturbations, Neural Computation, 14 (2002), pp. 2531–2560.
[14] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, Malware images: Visualization and automatic classification, in Proceedings of the 8th International Symposium on Visualization for Cyber Security, 2011, pp. 4:1–4:7.
[15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research, 12 (2011), pp. 2825–2830.
[16] S. Ranveer and S. Hiray, Article: Comparative analysis of feature extraction methods of malware detection, International Journal of Computer Applications, 120 (2015), pp. 1–7. Full text available.
[17] C. Ravi and R. Manoharan, Malware detection using windows api sequence and machine learning, International Journal of Computer Applications, 43 (2012), pp. 12–16.
[18] classifier, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,, 2016, pp. 1135–1144.
[19] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, Data mining methods for detection of new malicious executables, in Proceedings of the 2001 IEEE Symposium on Security and Privacy, Washington, DC, USA, 2001, IEEE Computer Society, pp. 38–49.
[20] M. R. Smith, A. Hill, K. D. Carlson, C. M. Vineyard, J. Donaldson, D. R. Follett, P. L. Follett, and et al., A novel digital neuromorphic architecture efficiently facilitating complex synaptic response functions applied to liquid state machines, in Proceedings of the IEEE International Joint Conference on Neural Network, 2017, pp. 2421–2428.
[21] M. R. Smith, T. Martinez, and C. GiraudCarrier, An instance level analysis of data complexity, Machine Learning, 95 (2014), pp. 225–256.
[22] S. Tobiyama, Y. Yamaguchi, H. Shimada, T. Ikuse, and T. Yagi, Malware detection with deep neural network using process behavior, in IEEE Annual Computer Software and Applications Conference, COMPSAC, 2016, pp. 577–582.
[23] I. You and K. Yim, Malware obfuscation techniques: A brief survey, in Proceedings of the 2010 International
Conference on Broadband, Wireless Computing, Communication and Applications, Washington, DC, USA, 2010, IEEE Computer Society, pp. 297–300.