A New Burrows Wheeler Transform Markov Distance

2019·Arxiv

Abstract

Abstract

Prior work inspired by compression algorithms has described how the Burrows Wheeler Transform can be used to create a distance measure for bioinformatics problems. We describe issues with this approach that were not widely known, and introduce our new Burrows Wheeler Markov Distance (BWMD) as an alternative. The BWMD avoids the shortcomings of earlier efforts, and allows us to tackle problems in variable length DNA sequence clustering.BWMD is also more adaptable to other domains, which we demonstrate on malware classifica-tion tasks. Unlike other compression-based distance metrics known to us, BWMD works by embedding sequences into a fixed-length feature vector. This allows us to provide significantly improved clustering performance on larger malware corpora, a weakness of prior methods.

1 Introduction

Compression algorithms can be used to measure the similarity between arbitrary sequences with little required domain knowledge or expertise. They have been used in bioinformatics(Mantaci et al., 2008), time series classification and clustering(Keogh et al., 2004), and malware analysis (Borbely, 2015). The bioinformatics and malware analysis domains can be particularly attractive for compression-based similarity measures. Both of these domains involve “short” sequences of tens of thousands of steps, and can often reach length. Other machine learning techniques often fail to work when dealing with sequences of such variety and length.

In this work, we note that the Extended Burrows Wheeler Transform (EBWT) (Mantaci et al., 2005) is a compression-based distance metric designed explicitly around the Burrows Wheeler Transform (BWT) (Burrows and Wheeler, 1994) algorithm for use in bioinformatics. While EBWT has been useful in that domain, we have discovered a number of weaknesses in this method that reduce its effectiveness and prevent it from being useful in other domains, such as malware detection.

To remedy these issues, we develop a new BWT-inspired distance measure that we refer to as the Burrows Wheeler Markov Distance (BWMD). Unlike EBWT, BWMD is a valid1 distance metric, and can scale to far larger problems that EBWT cannot tackle due to computational limits. Compared to other compression-based distances, our BWMD is the first to work by embedding a sequence into a Euclidean vector space. This gives a significant advantage to our approach in terms of clustering and query speed. This advantage is achieved by using algorithms that are designed around Euclidean distance, like k-means, that other methods cannot leverage.

We will begin by reviewing related work in the compression distance space, and the needed details of the BWT, the prior method EBWT, and a related method known as LZJD, in § 2. Next we will begin with a description of the new BWMD in § 3. In § 4 we will develop a number of new theoretical insights, proving 1) how EBWT has dramatic failure cases that violate our intuition of how a distance measure should work, 2) that BWMD does not have these failure cases, and 3) comparing how EBWT, BWMD, and LZJD handle randomness, and 4) that BWMD has unique properties in this regard. We will then move into empirical results in § 5 by comparing BWMD with EBWT on DNA sequence clustering, where we show that BWMD is able to cluster DNA sequences of varying lengths that EBWT fails to cluster in a meaningful way. In § 6 we will show how BWMD is able to scale to malware classification and clustering tasks that are beyond EBWT’s computational ability. Though LZJD provides better classification accuracy at this task, BWMD provides superior clustering results. Finally we will conclude in § 7

2 Compression Distances

Compression in general can be seen as one way of performing many machine learning tasks, and has deep connections to statistical methods. Following this intuition, Li et al. (2004) introduced the Normalized Information Distance (NID) as a method of measuring similarity using compression. Given a function K(x) that computes the Kolmogorov complexity (i.e., return the length of the shortest computer program that produces x as output), and the associated conditional Kolmogorov complexity K(x|y) (i.e., the length of the shortest computer program that produces x as output given y as input), the NID is a metric as defined in (1). The Kolmogorov complexities are uncomputable functions, making NID of no practical use.

To remedy this situation, Li et al. (2004) went on to present the Normalized Compression Distance (NCD), which replaces the uncomputable K(x) with C(x), which returns the length of x in bytes after running a compression algorithm. To approximate K(x|y), the concatenation of x and y (denoted by ) is used, giving the final form of NCD in (2). The compression algorithm chosen impacts the quality of the results. The bzip and LZMA algorithms have been popular due to a combination of reasonable run time performance and generally satisfactory compression ratios.

Where C(x) is the (integer) length of object x when compressed. NCD, and compression-based distances in general, do not require significant feature engineering to be applied in practice. This has made them popular for genomic phylogeny (Li et al., 2004; Cilibrasi and Vitanyi, 2005) and malware analysis (Bayer et al., 2009; Apel et al., 2009; Bailey et al., 2007; Karim et al., 2005), where sequences are longer than what most other techniques can handle (steps in length), and it can be difficult to extract more sophisticated features manually. However, using compression algorithms naively in NCD leads to difficulties with computational scalability and reduced accuracy/failure cases due to the fact that compression algorithms were not designed for similarity analysis (Borbely, 2015; Cebrián et al., 2005). For these reasons, some have looked at converting known clustering algorithms into explicit distance measures. For example, the Lempel Ziv Jaccard Distance (LZJD) (Raff and Nicholas, 2017a) converts the LZMA algorithm into a true distance measure, and for malware classification tasks has superior accuracy and runtime compared to NCD2.

2.1 Extended Burrows Wheeler Transform

Table 1: BWTThe Burrows Wheeler Transform (BWT) (Burrows and Wheeler, 1994) is a core component of the bzip compression algorithm, and has been widely used in information retrieval applications due to its ability to accelerate search queries(Ferragina and Manzini, 2005). The BWT takes an input string u of length n = |u|, over an alphabet , and produces a new string . Through the use of an end-of-file (EoF) marker, the BWT is invert-

ible, so u can be recovered from without loss.

BWT’s utility in compression is best understood through example. Consider Table 1, where the BWT transform of the string “easypeasy” is shown. BWT adds a special EoF marker “$”, and lexicographically sorts every single-character rotation of the string (observe column F, which is in sorted order). The BWT output is then the last column of each string, shown in column L. By computing the BWT version of the string, we can see how runs of the same character (“ee”, “aa”, “ss”) have been created that previously did not exist. Simple run-length encoding can then be applied to produce a compressed version of the string.

These sorted rotations can be computed in O(n) time, and provide a simple method of compression. To turn this into a distance measure, Mantaci et al. (2005) developed the Extended Burrows Wheeler Transform (EBWT). The EBWT works by defining the BWT over a pair of inputs u and v, and computing the sorted order of both sequences. An example of this is shown in Table 2.

Table 2: EBWT computation example

The distance between the two sequences u and v is defined by (3), where rep(i) returns how many times the i’th source occurred in a row, and that only t source transitions occurred.

Again, this concept is easier understood through example. Considering Table 2, we can see that the source string sequence is uuvvuvuvv. If we group this by transitions, we get . Thus the distance ebwt(u, v) = 1 + 1 + 0 + 0 + 0 + 1 = 3.

Mantaci et al. (2005) developed the EBWT for applications in bioinformatics, and developed theory to show a number of situations under which the EBWT will perform well or have desirable properties. However, it is still expensive to compute, requiring O(|u| + |v|) time for every distance computation. This makes EBWT less attractive as bioinformatic sequences become longer, and reduces its utility in other domains in which compression distances have found use, such as malware classification. While it was known that EBWT did not satisfy the triangle inequality, preventing it from being a true distance metric, previously unreported theoretical issues also exist. We will discuss these issues in § 4.

3 Burrows Wheeler Markov Distance

Inspired by the prior work we have just discussed, we now develop a new distance measure based on the Burrows Wheeler Transform. We will refer to our method as the Burrows Wheeler Markov Distance (BWMD), and it is simple to implement. To begin, consider again Table 1, where BWT(“easypeasy”) is shown. The BWT’s effectiveness as a compression algorithm comes explicitly from its ability to re-order the content such that repetitions are reduced to a first-order occurrence. This is why run-length encoding is effective. Because first-order compression is independently effective after the BWT, we do not need to consider more complex interactions over the extent of a file.

We also do not care about the invertibility of any transform. Our goal is to define a new feature space where we can perform effective machine learning and information retrieval. So we seek to build a small statistical summary of the data, rather than build an object from which we can recreate the original data. By measuring the similarity of these small summaries, we measure the similarity of the underlying sequences. Given that first-order compression is effective with the BWT, we chose to select a first-order statistical model. In particular, we can use a Markov model of the probability of observing token given the previous token . The tran- sition matrix can then be used as a statistical summary of the entire sequence BWT(u).

This is all that is needed to describe BWMD, and a succinct description is given below. Each step takes O(n) time for an input sequence of n bytes. 1[z] is the indicator function, which returns 1 if z is true, and 0 otherwise, and rows and columns of the transition matrix. 1. For each sequence u in a corpus of size N 2. Compute 3. Estimate flattened Markov transition vector

4. Normalize

After step (4) in the above process, we obtain from the input u a single feature vector x which we might use, in place of u, in machine learning or information retrieval algorithms. Regardless of the length of the input sequence u, the size of the vector x will depend only on the alphabet size working on raw bytes, this would be a feature vector. While such a vector takes up to 256 KB per sequence u, the individual input data objects we consider in this work range from 1 MB in size up to 400 MB. This makes the BWMD description quite compact by comparison. When u is shorter in length, the vector x can be stored in a sparse form, making the memory cost

The normalization in step (4) is also chosen intentionally. The Hellinger Distance is a metric over probability distributions. For the case of two discrete probability distributions and , the Hellinger Distance is defined as follows:

Due to the form of (4), the Hellinger distance corresponds to the Euclidean distance between two transformed and scaled versions of P and Q. By using the square root of the coeffi-cients in step (4), divided by, we have a feature vector x that has been placed into a space where the Euclidean distance can be computed as usual. The results are then equivalent to the Hellinger distance between Markov transition vectors, giving us a statistically sound interpretation for BWMD. That the BWMD is a valid distance metric also follows immediately from the use of the Euclidean distance, which is well known to be a valid metric, a property that the EBWT lacks.

We explicitly use the Hellinger distance over other alternatives, like KL-Divergence, because it corresponds exactly to the Euclidean distance after transformation. This makes is a valid distance metric for which we can make mathematical statements about its behavior with little work, and prove it does not have the same shortcomings as prior methods in this space. In addition, most other clustering and fast retrieval algorithms are built around the Euclidean distance, making our method compatible with a maximal number of complimentary techniques. This is not the case for any prior compression based measure. The ability to use algorithms like k-means, where other alternatives cannot, provides BWMD with advantages in terms of clustering accuracy, as well as computational efficiency to handle big data.

Note that because of the BWT, our approach is not equivalent to 2-grams. Our comparison with LZJD, which can be interpreted as adaptive variable-length gram(Raff and Nicholas, 2017b), and BitShred, which uses 16-grams, will show that our method is meaningfully more effective than simple ngram approaches.

4 Theoretical Results

We begin by developing a stronger theoretical understanding of our new method, as well as the prior approach EBWT. Prior works have looked at a number of properties of the EBWT(Mantaci et al., 2007, 2008, 2005), and describe situations in which EBWT will behave as a metric for a subset of possible inputs, and that it is invertible like the standard BWT. However, our interest in BWT is as a general purpose similarity measure for information retrieval and machine learning applications. We begin by showing three undesirable properties of the EBWT that reduce our confidence in its use for such applications. Then we will investigate the nature of our new BWMD in these same cases where EBWT might fail.

4.1 EBWT Shortcomings

First we show a simple property that is a direct result of the EBWT measuring distance as a function of repeated source sequences. When we have two strings u and v in any alphabet , it is necessarily the case that the distance is bounded below by the difference in sequence lengths |u| and |v|. If u and v differ significantly, we are unlikely to be able to make meaningful similarity comparisons.

Theorem 1. The distance

Proof. Consider any two strings u and v. The minimum distance involves the maximum number of transitions between string sources. If |u| < |v|, that means there can be at most transitions, going back and forth between u and v on the merging. That necessitates repetitions at the end. Given the definition of ebwt in (3), that means the minimum distance between

The above result leads us to suspect that EBWT will not be useful when the sequences being compared are of varying lengths. The greater the difference in sequence length, the more troubling this issue might become. Given the insight from Theorem 1, we move on to a more serious departure from our intuition of how a distance measure should behave. In particular, if u is a subset of v, we should expect the distance between u and v to be small. Instead, it is possible for EBWT to return the maximal distance under this scenario.

Theorem 2. It is possible for maximum possible distance, even if

Proof. Consider the string and , such that and . Because of the topographical sorting, all rotations of u and v will have the same characters, and so the sorting will only resolve once the max substring length is reached. Since all rotations in u are of length , which is shorter than , a sorting will place all rotations of u before any rotations of v. This results in a transition pattern of , and thus

Theorem 2 defies our expectations in the case of similar inputs. If we use the behavior of the NID from (1), we would expect the distance in this scenario to be small. Consider the proof’s example with and : we would expect NID(. This is because v could be encoded as the sequence u repeated times, which in the worst case, can be represented in a number of bits logarithmic in the value being encoded (i.e., the nature of any big-integer representation is that the maximum value that can be represented grows exponentially with a doubling of the bits).

We now show that EBWT likewise surprises us in the case of dissimilar inputs, where we can have u and v with no overlap in content, but EBWT identifies as having maximal similarity.

Theorem 3. It is possible for ebwt(u, v) = 0, even if there exists no shared characters between u and v. More formally, there exists u, v and an alphabet such that for every

Proof. Let denote the concatenation of a and b, and the sequential concatenation of 1, 2, . . . , N. Without loss of generality, define the string and v = . Thus |u| = |v|, but in the lexicographical sorting u and v will alternate between rotations u[0] = 2, . Thus ebwt(u, v) will always contain transitions with no source repetition, hence ebwt(u, v) = 0.

The reader may note that the construction of Theorem 3 would allow one to argue that the scenario should return a small distance under the ideal Kolmogorov distance with NID. This could be argued because the construction of u and v allows v to be represented as v = u + 1. However, the same scenario can occur with randomized strings where the alphabet does not increment with any simple pattern and is filled with random tokens, so long as there is no overlap in the tokens and the tokens “balance out” once sorted (i.e., there is no This requires further expanding the alphabet size such makes Theorem 3 the least practical of our concerns. However, we find it enlightening as a theoretical shortcoming which we would prefer to avoid.

While these issues give us cause for concern, EBWT has found use in practice. We will show in § 5 that our concern for EBWT’s utility is more justified when sequences have varying length.

4.2 Behaviors of BWMD

To delve into BWMD’s behavior, we will begin by analyzing the same scenarios used to show our theoretical concerns with the EBWT in the preceding section, as well as compare the behavior of BWMD to that of the LZJD algorithm.

Corollary 1. Given and , such that

Proof. Under the construction of the embedding of since there will be only one transition pattern of . As such, the value at that index . Since the embedding will have the same construction, and thus, , and . Therefore, the distance between u and v will be zero.

The above also does not conform to expectation with respect to the NID, because we are ignoring the length of the inputs in our computation of distances. The NID(u,v) would be greater than zero in this scenario and is necessitated by storing the difference in repetition lengths and . This tells us BWMD will be less sensitive to differences in sequence length, which may be desirable or not, depending on the application.

The behavior of LZJD in this scenario was used to prove its sensitivity to potential repetition of the input. It was shown that LZJDas a lower bound for a similar scenario (Raff and Nicholas, 2017a). This distance grows at a rate considerably faster than logarithmic, but is also better than the EBWT distance in this case. We conclude that both LZJD and BWMD have better behavior, but BWMD will lower-bound the NID and LZJD will upper-bound the NID.

In a similar manner as Corollary 1 was shown, the same construction can be applied to Theorem 3’s issue for BWMD.

Corollary 2. For all u, v such that for every , then bwmd(u, v) = 1, the maximum possible distance.

The derivation follows from the fact that . This means the distance computation will reduce to . Therefore, the distance when the embeddings have no intersection is maximized. In this case, BWMD aligns well with the behavior we would expect from the NID. Likewise it is easy to see that LZJD will return its maximum distance of 1 in this scenario as well. LZJD measures the set intersection, so when the sets have no intersection, then maximal distance is achieved.

5 Genomic Clustering

We begin by showing that our new BWMD has similar utility as the original EBWT distance for genomic phylogeny from DNA sequences. This was the original proposed use of the EBWT measure, where they evaluated the SingleLink Clustering results on mitochondrial DNA (mtDNA) (Mantaci et al., 2005). Such data can be obtained using the NIH GeneBank , which we have used to create a similar corpus of DNA sequences to compare the relative pros and cons of BWMD and EBWT. We will use both mtDNA as has been done in prior work, but also a more challenging case with chromosomal genomic scaffold DNA sequences. We will produce dendrograms for each tasks with Single-Link Clustering (SLINK).

Figure 1: Single-Link Clustering result on mitochondrial mtDNA data.

First, we will evaluate mtDNA data on 28 different species, and use Single-Link Clustering to produce a dendrogram of the species based on their mtDNA. The results are shown in Figure 1, with both EBWT and BWMD taking under 1 second to perform the clustering. Our goal is not to fully evaluate the quality of each dendrogram, but to show that both methods produce reasonable results in this case, which may be of interest to researchers in bioinformatics. Both EBWT and BWMD do reasonably well at this task, with differing mistakes, advantages, and disadvantages.

EBWT gets most base level groups correct (e.g., lion, tiger, cat in one group, primates grouped together). There is a failure to properly group the harbor and gray seals as related to each other, and instead act as outliers which SLINK is forced into a cluster at the end at higher cost. EBWT also fails to group the white rhino with other members of the Ferungulates family (e.g., the horse and zebra would have been closest members) (Cao et al., 1998). BWMD was able to correctly pair the seals and placed white rhino with a larger family of Ferungulates (closest to horse, which is correct, and with the cows and yak which are members). But BWMD failed to place the mouse and rat together, and dispersed the zebra and lion from their more appropriate neighbors.

Results with both methods are reasonable. However, the mtDNA task is an easier task. All of the mtDNA sequences are of similar sizes, with the western grey kangaroo being shortest at 15 KB in length and the lion being longest at 17 KB. Our theoretical results in § 4 would indicate that we may see more significant issues if we had sequences of varying length. We explore this with a dataset of chromosome genomic scaffold DNA for a subset of the species evaluated. We selected one whole scaffold DNA sequence from a random chromosome for the 11 species where this was available. We selected “unplaced genomic scaffold” sequences for 3 remaining species (the yaks and tiger), which is a much shorter and incomplete amount of data. This gives us a minimum sequence size of 22 KB and a maximum of 33 MB.

Figure 2: Single-Link Clustering on Genomic Scaffolding.

The SLINK results are shown in Figure 2. At a base level, the cost to perform EBWT and its scalability issues are more pronounced. BWMD takes only 47 seconds to perform SLINK clustering. EBWT took 28 minutes, making it over 35slower. When plotting the EBWT dendrogram in Figure 2b, we include the size of the DNA sequence in parentheses. When organized in this way, it becomes clear that the EBWT clustering is degenerate, and corresponds exactly to file-size, rather than content.

In contrast, the BWMD in Figure 2a produces reasonable groupings despite disparate sequence lengths. For example, (full scaffold) cat and (unplaced and incomplete) tiger are correctly grouped despite the cat sequence being 450longer. The BWMD results are not perfect: the orangutan and domestic yak were placed farther from the other (well-grouped) primates and ferungulates respectively. Overall we can see that groups which should be placed near each other are, without degrading to sequence length information.

We include LZJD ( Figure 2c )in this scenario, to demonstrate that BWMD has increased value over other alternatives. Here we can see that LZJD suffers the same failure as EBWT in placing the three smallest partial segments into a single cluster. With the exception of correctly grouping the two chimpanzee species, the LZJD dendrogram is degenerate.

These results are in line with our theory derived in § 4. When working with sequences of homogeneous length, EBWT performed well. But BWMD is able to handle disparate sequence lengths reasonably well, where EBWT degrades to grouping by sequence length rather than content.

BWMD’s ability to handle the original mtDNA data, as well as substantially better results with the irregular-sized scaffold DNA, is made more impressive by the fact that BWMD is encoding everything into due to the small alphabet . This is a reduction in storage cost by a factor of up to 515.6, and allows for more flexibility in creating a larger and searchable index using BWMD.

5.1 A note on BMWD’s Disadvantage

It is also worth noting that, from an information encoding perspective, BWMD is at a disadvantage in this testing over DNA data. EBWT is dimensionless, and has the representation capacity of the merged sorting of two different strings, meaning its representational capacity is a function of the sequence length under consideration. BWMD encodes each sequence into a fixed-length feature vector of size . Since we are working with DNA data, the alphabet is quite small. As such all DNA sequences in these experiments, up to 33 MB in size, are being embedded into a 16-dimensional space. BWMD’s ability to match or outperform EBWT means we must be doing a significantly better job at leveraging the first-order information expressed by the Burrows Wheeler Transform.

6 Malware Results

It can be difficult to reliably parse malicious software as malware authors may intentionally violate rules and format standards. Compression based similarity measures are useful in this case as they allow us to avoid these complex parsing issues. In this section we will look at the classification accuracy of BWMD and LZJD for several datasets. Our expectation is that LZJD will have better classification accuracy, due to Lempel-Ziv compressors being more effective than those based on Burrows Wheeler. However, we find that BWMD has significant advantages when clustering is the goal accuracy by up to 65.6, and is 24faster to search large corpora by obtaining sub-linear scaling with no loss in accuracy. Since LZJD cannot achieve this, this advantage will only increase with corpus size.

Many prior works have looked at malware classification and clustering by processing raw bytes, due to the difficulty of parsing malware. We will compare with the seminal BitShred algorithm (Jang et al., 2011) for clustering. Other similar byte based distance functions such as Ssdeep and Sdhash where evaluated, but founds to have degenerate performance on our smallest and easiest corpora, which can be found in appendix § 6.1. Compression distances such as EBWT and NCD are not evaluated in this section, because it would require multiple compute months for our smallest dataset, and simply cannot scale to the size of our malware corpora.

Table 3: Malware datasets used in experiments.

For our evaluation, we will use several datasets summarized in Table 3. The EMBER dataset (Anderson and Roth, 2018) pertains to a binary classification problem of “benign vs malicious” for Windows executables. Because there are only two classes, clustering results will not be evaluated on this corpus. However it is by far the largest corpus, allowing us to explore the scalability of our algorithms. The raw files can be obtained from VirusTotal (www.virustotal.com) and are nearly 1TB total. Our remaining datasets will be multiclass problems where each sample is a member of a specific malware family. We will use these to evaluate both classifica-tion accuracy, as well as accuracy in clustering with respect to the class labels. Using VirusShare (Roberts, 2011) we create another Windows based dataset with 20 malware families. The families were determined using VirusTotal and the AVClass tool which determines a single canonical malware family label based on multiple Anti-Virus outputs (Sebastián et al., 2016). We select the 20 most populous families, and use 7,000 examples for training and 3,000 for testing. The last four datasets we use are evaluated in “two forms”, following prior work (Raff and Nicholas, 2017a). The Kaggle datasets are from a 2015 Kaggle competition sponsored by Microsoft (Ronen et al., 2018). In the “Bytes” version our algorithms are run on the raw malware binary, and in the “ASM” version the output of IDA-Pro’s disassembler is used instead. From the Drebin corpus (Arp et al., 2014) we use the 20 most populous families, where the “APK” version is the raw bytes of the Android APK (essentially a Zip file with light compression), and the “TAR” version which unpacks the APK and recombines all content into a single tar file.

6.1 Malware Classification

We begin our analysis by looking at nearest neighbor classi-fication performance of various methods. The performance of each algorithm under this scenario gives us insight not only to its utility, but how effective it would be for analysts in finding similar malware. Utility in this scenario requires both high accuracy and computational efficiency, as malware corpora are often measured in the terabyte to petabyte range.

Small Scale Malware Classification The Kaggle and Drebin corpora are considerably smaller in size, allowing us to test a wider selection of methods against them. In the below table we use balanced accuracy, where the weights of each file are adjusted so that the total weight of each class is equal, because malware families are not evenly distributed in each corpus.

Table 4: Balanced Accuracy results for 1-NN classification on each dataset. Results show mean 10-Fold Cross Validation accuracy (standard deviation in parentheses). Best results in bold, second best in italics.

We can see from these results that the compression-based approaches, LZJD and BWMD, generally outperform other alternatives by a wide margin. As was expected, LZJD has higher accuracy that BWMD, since LZJD is based on a more effective compression algorithm. While this is a slight weakness of BWMD, its advantage comes in being orders of magnitude faster, as we will show in the large scale testing in the next section. This makes it the only method usable for largerscale corpora. This also shows that Ssdeep and Sdhash are simply not accurate enough to be considered for use, without regard to computational constraints.

BWMD performed second best on every dataset, the only exception being a 3 point difference to the BitShred algorithm. However, BWMD outperformed BitShred by at least 11 points on all other datasets. Supporting our theoretical analysis in § 4, we also see hints that BWMD is better equipped to work with extremely long sequences. Most notably, BWMD is the only method which had improved accuracy and reduced variance when moving from Kaggle Bytes (4.67 MB) to Kaggle ASM (13.5 MB). This suggests that the disassembly may be in a form that allows the BWT to better capture first-order dependencies for compression.

The fact that BWMD has non-trivial accuracy on Drebin APK (random guessing is 5%) is particularly impressive and worth noting. This is because the APK files are essentially Zip files with a standard structure, and the Zip compression format is a more effective one than most BWT based methods such as bzip. As such, that there is any first-order information exploitable for effective similarity search is impressive, and indicates the utility of BWMD in wider applications.

Large Scale Malware Classification On EMBER we use 9-Nearest Neighbors as our classifier so that we can compute meaningful values for the Area Under the ROC Curve (AUC) (Bradley, 1997). In this malware detection context, AUC can be interpreted as an algorithm’s ability to properly prioritize all malicious software above all benign software. This metric is useful for prioritizing work queues, and is therefore particularly pertinent. We evaluate only BWMD and LZJD due to computational constraints on this larger dataset.

BWMD obtains an AUC of 98.3%, where LZJD acheives a slightly better 99.7% AUC. As expected, LZJD obtains a higher accuracy than BWMD, because LZJD is built upon a more effective compression algorithm. However, accuracy in isolation does not determine what method is best to use. Due to the large size of malware corpora, sub-linear scaling is needed to be useful for realistic sized datasets.

BWMD does have an advantage over LZJD in its ability to scale to large corpora in an efficient manner. This is critical, since industry datasets routinely require comparisons to terabytes of files or more (Roussev, 2010). Prior work has tried, with limited success, to scale LZJD to larger corpora. Using an extension of the Vantage Point (VP) tree (Yianilos, 1993), only a 2.5x speedup over brute force search was achieved(Raff and Nicholas, 2018a). Because BWMD operates by embedding files into a Euclidean space, we can leverage specialized algorithms like the Dynamic Continuous Index (DCI) algorithm (Li and Malik, 2017) that only work for the Euclidean distance. DCI works by projecting the whole dataset down to different, random, embeddings, and allows obtaining the true nearest neighbors in a fast and effi-cient manner. LZJD is not compatible with such algorithms, resulting in BWMD being better equipped for this task.

Figure 3: Table shows 9-NN search retrieval speed on the Ember test set (in milliseconds, y-axis, log-scale) as the number of training points (x-axis, log-scale) increases.

In Figure 3 we compare the total query time of BWMD and LZJD under different indices, as the training set size increases from 64 files up to the full 600,000. We found the VP tree of minimum variance (Raff and Nicholas, 2018a) performed best compared to other algorithms like KD and Cover-trees, and so only its results are included. In the dashed line we show BWMD accelerated with the Dynamic Continuous Index (DCI) algorithm (Li and Malik, 2017).

We can see that as the training corpus becomes larger, the

VP trees are able to get small constant factor speedups, but are not able to reliably prune large portions of the search space. Because BWMD is in Euclidean space, it is the only method able to leverage the DCI algorithm and thus able to get significant order-of-magnitude search speedups. This combination makes BWMD 24faster than LZJD (5.6 CPU hours compared to 5.6 days), and 834faster than BWMD with a brute force search. One can clearly see that DCI’s scaling is sub-linear, and its advantage grows with the corpus size. This is obtained with no loss in accuracy on the Ember corpus, making BWMD the only effective approach for scaling to even larger corpora.

6.2 Malware Clustering

In this section we will show that BWMD has significant advantages in terms of clustering malware into families. This benefit comes largely from BWMD mapping sequences into a Euclidean feature space, where we can leverage tried-and-true algorithms like k-means to perform fast and useful clustering. LZJD is incompatible with k-means, and similar methods that require an explicit euclidean feature vector. As such LZJD, like BitShred, is constrained to distance based clustering methods like agglomerative clustering. This puts them at a significant disadvantage compared to BWMD.

Evaluating the quality of our clustering results, we will consider three measures: Homogeneity, Completeness, and VMeasure, as introduced by Rosenberg and Hirschberg (2007) and using the class labels as ground truth cluster assignments. Homogeneity measures how well an algorithm does at making each found cluster as “pure” as possible (i.e., only one class in each cluster). Completeness measures how well an algorithm groups all examples of a class into as few clusters as possible (i.e., all examples of one class in only one cluster). V-Measure is the harmonic average of Homogeneity and Completeness. All three metrics are measured on the scale [0, 1], with 0 being worst, and 1 being the maximum score.

In performing the clustering, we will test using k = the true number of classes and the true number of classes. The former (k = C) is done to judge how well the clustering algorithms are able to recover the underlying ground truth. The latter () is done as it corresponds best to how a malware analyst would desire to use these tools.It is easier to over-estimate the number of clusters than to predict the exact value of k, and by clustering an analyst would hope to reduce their workload by quickly checking that files in the same cluster are related, and then performing an in-depth analysis on only a few representatives from each cluster (VanHoudnos et al., 2017). For this reason, we consider Homogeneity the most important of the three measures, as it corresponds with how an analyst would use clustering, followed by V-Measure, and then Completeness.

BWMD is the only method that can leverage the kMeans algorithm, and we use Hamerly’s variant because it avoids redundant computation while returning the exact same results(Hamerly, 2010). For LZJD and BitShred we use Average-Link clustering using a fast algorithm (Müllner, 2011). While the original BitShred paper used Single-Link, we found Average link provided the best results across all metrics for both BitShred and LZJD. The results

Table 5: Clustering performance of BWMD, LZJD, and BitShred. Best results shown in bold.

are shown in Table 5, where we can see BWMD dominates LZJD and BitShred by our most important metrics, Homogeneity and V-Measure3. BWMD’s advantage in this regard is often dramatic. For example, BWMD scores on Homogeneity compared to LZJD when Kaggle bytes dataset, and better than BitShred. While BWMD does not always perform best by the Completeness metric, it is always competitive with the best scoring method, which is why BWMD dominates by V-Measure . The results overall clearly indicate that BWMD provides the best clusterings across multiple datasets, of different encodings, and different numbers of clusters, showing the flexibility of the compression-based approach.

Because BWMD can leverage the k-means concept and the many efficient algorithms for its computation, it is also the most scalable for these methods. LZJD and BitShred are inherently limited by the lower-bound complexity of hierarchical clustering 4 . For example, BWMD took only 27 minutes to over-cluster the 160,000 files in the VS20F training set, the largest under consideration. This is faster than LZJD which took 7.76 hours, and faster than Bitshred at just over a day.

7 Conclusion

We have developed and introduced the Burrows Wheeler Markov Distance (BWMD), a new distance metric inspired by the Burrow Wheeler Transform. A theoretical analysis has shown several ways in which BWMD has better behavior, which is confirmed by showing new abilities for clustering DNA sequences that prior methods could not handle. For malware clustering, we have shown BWMD considerably outperforms prior methods in both speed and accuracy, and BWMD is the only byte-based method which can achieve sub-linear search scaling on larger corpora.

References

H. S. Anderson and P. Roth. EMBER: An Open Dataset for Training Static PE Malware Machine Learning Mod-

3Not included due to space limitations, BWMD also dominates

by Normalized Mutual Information and Adjusted Rand Index. 4Other alternatives, like k-medoids, have similar

complexity.

els. ArXiv e-prints, 2018. URL http://arxiv.org/abs/1804. 04637.

M. Apel, C. Bockermann, and M. Meier. Measuring similarity of malware behavior. In 2009 IEEE 34th Conference on Local Computer Networks, number October, pages 891–898. IEEE, 10 2009. ISBN 978-1-4244-4488-5. doi: 10.1109/LCN.2009.5355037. URL http://ieeexplore.ieee. org/document/5355037/.

D. Arp, M. Spreitzenbarth, H. Malte, H. Gascon, and K. Rieck. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS), (February): 23–26, 2014. doi: 10.14722/ndss.2014.23247.

M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, and J. Nazario. Automated Classification and Analysis of Internet Malware. In Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, RAID’07, pages 178–197, Berlin, Heidelberg, 2007. Springer-Verlag. ISBN 3-540-74319-7, 978-3-540-74319-4. URL http://dl.acm.org/citation.cfm?id=1776434. 1776449.

U. Bayer, P. M. Comparetti, C. Hlauschek, C. Kruegel, and E. Kirda. Scalable , Behavior-Based Malware Clustering. NDSS, 9, 2009.

K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When Is ”Nearest Neighbor” Meaningful? In Proceedings of the 7th International Conference on Database Theory, ICDT ’99, pages 217–235, London, UK, UK, 1999. Springer-Verlag. ISBN 3-540-65452-6. URL http: //dl.acm.org/citation.cfm?id=645503.656271.

R. S. Borbely. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques, pages 1–8, 2015. ISSN 2263-8733. doi: 10. 1007/s11416-015-0260-0.

A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7):1145–1159, 1997. ISSN 00313203. doi: 10.1016/S0031-3203(96)00142-2.

M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Technical report, DEC Systems Research Center, Palo Alto, Califorina, 1994. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi= 10.1.1.141.5254.

Y. Cao, A. Janke, P. J. Waddell, M. Westerman, O. Takenaka, S. Murata, N. Okada, S. Pääbo, and M. Hasegawa. Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders. Journal of Molecular Evolution, 47(3):307–322, 1998. ISSN 1432-1432. doi: 10.1007/PL00006389. URL https://doi.org/10. 1007/PL00006389.

M. Cebrián, M. Alfonseca, A. Ortega, and others. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Communications in Information & Systems, 5(4):367–384, 2005.

R. Cilibrasi and P. M. B. Vitanyi. Clustering by Compression. IEEE Transactions on Information Theory, 51(4):1523– 1545, 4 2005. ISSN 0018-9448. doi: 10.1109/TIT.2005. 844059. URL http://dx.doi.org/10.1109/TIT.2005.844059.

P. Ferragina and G. Manzini. Indexing Compressed Text. J. ACM, 52(4):552–581, 7 2005. ISSN 0004-5411. doi: 10.1145/1082036.1082039. URL http://doi.acm.org/10. 1145/1082036.1082039.

G. Hamerly. Making k-means even faster. In SIAM International Conference on Data Mining (SDM), pages 130–140, 2010. URL http://72.32.205.185/proceedings/datamining/ 2010/dm10_012_hamerlyg.pdf.

J. Jang, D. Brumley, and S. Venkataraman. BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis. In Proceedings of the 18th ACM conference on Computer and communications security - CCS, pages 309– 320, New York, New York, USA, 2011. ACM Press. ISBN 9781450309486. doi: 10.1145/2046707.2046742. URL http://dl.acm.org/citation.cfm?doid=2046707.2046742.

M. E. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13–23, 2005. ISSN 1772-9904. doi: 10.1007/s11416-005-0002-9. URL http: //dx.doi.org/10.1007/s11416-005-0002-9.

E. Keogh, S. Lonardi, and C. A. Ratanamahatana. Towards Parameter-free Data Mining. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 206–215, New York, NY, USA, 2004. ACM. ISBN 1-58113-888-1. doi: 10.1145/1014052.1014077. URL http://doi.acm.org/ 10.1145/1014052.1014077.

K. Li and J. Malik. Fast k-Nearest Neighbour Search via Prioritized DCI. In Thirty-fourth International Conference on Machine Learning (ICML), 2017. URL http://arxiv.org/ abs/1703.00440.

M. Li, X. Chen, X. Li, B. Ma, and P. M. Vitanyi. The Similarity Metric. IEEE Transactions on Information Theory, 50(12):3250–3264, 2004. ISSN 0018-9448. doi: 10.1109/TIT.2004.838101.

S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression. In Proceedings of the 16th Annual Conference on Combinatorial Pattern Matching, CPM’05, pages 178–189, Berlin, Heidelberg, 2005. Springer-Verlag. ISBN 3-540-26201-6, 978-3-540-26201-5. doi: 10.1007/11496656{\_}16. URL http://dx.doi.org/10.1007/11496656_16.

S. Mantaci, A. Restivo, G. Rosone, and M. Sciortino. An extension of the Burrows–Wheeler Transform. Theoretical Computer Science, 387(3):298–312, 11 2007. ISSN 0304-3975. doi: 10.1016/J.TCS.2007.07.014. URL https://www.sciencedirect.com/science/article/pii/ S0304397507005282.

S. Mantaci, A. Restivo, and M. Sciortino. Distance measures for biological sequences: Some recent approaches. International Journal of Approximate Reasoning, 47(1):109–124,

1 2008. ISSN 0888-613X. doi: 10.1016/J.IJAR.2007.03. 011. URL https://www.sciencedirect.com/science/article/ pii/S0888613X07000382.

G. Manzini. An Analysis of the Burrows-Wheeler Transform. Journal of the ACM, 48(3):407–430, 5 2001. ISSN 0004-5411. doi: 10.1145/382780.382782. URL http://doi.acm. org/10.1145/382780.382782.

L. Martignoni, M. Christodorescu, and S. Jha. OmniUnpack: Fast, Generic, and Safe Unpacking of Malware. In TwentyThird Annual Computer Security Applications Conference (ACSAC 2007), pages 431–441. IEEE, 12 2007. ISBN 0-7695-3060-5. doi: 10.1109/ACSAC.2007.15. URL http: //ieeexplore.ieee.org/document/4413009/.

D. Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011. URL http://arxiv.org/abs/1109.2378.

E. Raff and C. Nicholas. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17, pages 1007–1015, New York, New York, USA, 2017a. ACM Press. ISBN 9781450348874. doi: 10.1145/ 3097983.3098111. URL http://dl.acm.org/citation.cfm? doid=3097983.3098111.

E. Raff and C. Nicholas. Malware Classification and Class Imbalance via Stochastic Hashed LZJD. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ’17, pages 111–120, New York, NY, USA, 2017b. ACM. ISBN 978-1-4503-5202-4. doi: 10.1145/3128572.3140446. URL http://doi.acm.org/10. 1145/3128572.3140446.

E. Raff and C. Nicholas. Toward Metric Indexes for Incremental Insertion and Querying. arXiv, 2018a. URL http://arxiv.org/abs/1801.05055.

E. Raff and C. K. Nicholas. Lempel-Ziv Jaccard Distance, an effective alternative to ssdeep and sdhash. Digital Investigation, 2 2018b. ISSN 17422876. doi: 10.1016/j.diin.2017. 12.004. URL https://doi.org/10.1016/j.diin.2017.12.004.

E. Raff, J. Aurelio, and C. Nicholas. PyLZJD: An Easy to Use Tool for Machine Learning. In C. Calloway, D. Lippa, D. Niederhut, and D. Shupe, editors, Proceedings of the 18th Python in Science Conference, pages 97–102, 2019. doi: 10.25080/Majora-7ddc1dd1-00e. URL http:// conference.scipy.org/proceedings/scipy2019/pylzjd.html.

J.-M. Roberts. Virus Share, 2011. URL https://virusshare. com/.

R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and M. Ahmadi. Microsoft Malware Classification Challenge, 2018. URL https://www.kaggle.com/c/malware-classification/.

A. Rosenberg and J. Hirschberg. V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning(EMNLP-CoNLL), pages 410–420, 2007.

V. Roussev. Data Fingerprinting with Similarity Digests. In K.-P. Chow and S. Shenoi, editors, Advances in Digital Forensics VI: Sixth IFIP WG 11.9 International Conference on Digital Forensics, Hong Kong, China, January 4-6, 2010, Revised Selected Papers, pages 207–226. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010. ISBN 978-3-642-15506-2. doi: 10.1007/978-3-642-15506-2{\_}15. URL http://dx.doi.org/10.1007/978-3-642-15506-2_15.

M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero. AVclass: A Tool for Massive Malware Labeling. In F. Monrose, M. Dacier, G. Blanc, and J. Garcia-Alfaro, editors, Research in Attacks, Intrusions, and Defenses: 19th International Symposium, RAID 2016, pages 230–253. Springer International Publishing, Paris, France, 2016. ISBN 978-3-319-45719-2. doi: 10.1007/978-3-319-45719-2{\_}11. URL http://dx.doi.org/10.1007/978-3-319-45719-2_11.

N. VanHoudnos, W. Casey, D. French, B. Lindauer, E. Kanal, E. Wright, B. Woods, S. Moon, P. Jansen, and J. Carbonell. This Malware Looks Familiar: Laymen Identify Malware Run-time Similarity with Chernoff faces and Stick Figures. In 10th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS), 2017. doi: 10.4108/eai.22-3-2017.152417.

P. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms, page 311–321. Society for Industrial and Applied Mathematics, 1993. URL http://dl.acm.org/citation.cfm? id=313789.

Appendix A Compression Distances With High Entropy

As a final analysis, we will look at the behavior of EBWT, LZJD, and our new BWMD in cases where input strings have high entropy. Such situations are common in malware classi-fication, due to the use of “packing”, where a malware author will use compression and/or encryption to either reduce a file’s size or hide its contents (Martignoni et al., 2007).

A.1 Similarity of Two Random Files

First we will consider the behavior when measuring the distance between two files, both of which are completely uniform random strings of characters, and so have a maximal entropy is the prob- ability of observing character s from the alphabet . For an empirical perspective, in Figure 4 we show a contour plot of the normalized distance (i.e., scaled to be in the range [0, 1]) over an alphabet . On the x and y axis are the lengths of the random files being compared, and the value plotted is the average normalized distance over 30 trials of different random sequences for each trial. In the plot, dark blue indicates a near zero distance, and dark red a distance near one.

For EBWT, the intuition is that random content will result in a random ordering in the EBWT merge step, and thus have a few source repetitions, and ultimately produce a small normalized distance (i.e., see that EBWT follows this behavior, but has an unusual and undesirable pattern of non-zero values for smaller sequences (). Along the diagonal of equal-length files, a value of occurs, but then increases on the off diagonal, before decreasing again once the size

For BWMD, this analysis is more grounded. We know that for a random sequence, the transformed BWT has nothing it can compress, and will result in a different permutation of the still random string. This means the distribution of the prefix to every substring will not change (i.e., the preceding character is random, which is true by definition), and so we expect the transition vector x in step 3 of the BWMD to converge toward the uniform distribution as the size of the input string increases. So in the limit, we know two random files will receive a distance near zero. However, if one (or both) of the random files are sufficiently small, a large distance will occur due to a lack of convergence toward the uniform distributions. The rate of change will depend on the alphabet size , and can be described as the expected value of the differences between transformed multinomial distributions. While bounding this value mathematically is difficult, we can see in Figure 4 that this expectation holds true empirically.

Because LZJD measures the similarity of compression dictionaries from the Lempel Ziv algorithm, it was assumed that distances would become smaller as the size of a random sequence increased, due to the accumulation of all smallest substrings (Raff and Nicholas, 2017a). We see in practice that this does not quite occur. For sequences of significantly different length, LZJD returns a distance near 1, in part due to the size difference forcing the intersection to be a small portion of the union of two files. When two random sequences are of a more similar length, we see that the LZJD distance

Figure 4: Distance between two random files of different lengths.

does begin to drop to nearly , but still remains relatively high. This makes sense due to the random component of the input strings. As the set of all shortest substrings is constructed, consider the scenario where the compression dic- tionary has all substrings of length , but has only processed half of the total sequence. For the remaining half, we will obtain an essentially random sampling of strings of length , of which there are as many strings of increased length (one for every character in the alphabet , times every string of length ). If two strings are of the same length, the likelihood of seeing the same strings of length then becomes small, increasing the distance, while all strings of size increase the similarity and reduce distance.

Between BWMD and LZJD, we have two opposing styles of measuring the distance between two high-entropy sequences. As was noted in (Raff and Nicholas, 2017a), one could argue that two random sequences should have high distance because they share no underlying information (this is how NID would behave), or lower distance because both sequences share the property of “looking random”. While we argue that EBWT has a more confusing and inferior behavior, a preference to BWMD or LZJD’s behavior may be task-specific.

A.2 High Entropy vs Low Entropy

An important distinction in behaviors on random sequences is when a uniform random sequence (max entropy=1) is compared to a second sequence with a different level of entropy. For BWMD, we can derive intuition about its behavior in this case from work on analyzing the BWT compression algorithm. In particular, it has been shown that the BWT’s efficiency at compression of k’th order combinations over the alphabet is bounded by k’th order empirical entropy (Manzini, 2001). Thus we should anticipate that BWMD will produce differing Markov transition matrices, and thus differing distances, as the entropy of the second sequence changes. We confirm this empirically in Figure 55.

Figure 5: Average distance (y-axis) between a primary sequence of length and a second sequence of a specific entropy between 0 and 1 (x-axis), and also of length

We can see that LZJD and EBWT do not share this desirable property of differing distance with differing entropy. Given a single sequence of high entropy, LZJD will deem al- most all other entropy sequences equidistantly far, and EBWT all other entropy sequences equidistantly near. In the case of LZJD, this results in behavior effectively equivalent to when the dimension of a problem continues to increase, and all other points becoming increasing equidistant to each other (Beyer et al., 1999). We believe BWMD’s ability to delineate degrees of similarity/distance based on relative entropy levels is an important factor toward its ability to provide better clustering results compared to LZJD, which will be shown later in § 6.