Benchmarking Symbolic Execution Using Constraint Problems -- Initial Results

2020·Arxiv

Abstract

Abstract

Symbolic execution is a powerful technique for bug finding and program testing. It is successful in finding bugs in real-world code. The core reasoning techniques use constraint solving, path exploration, and search, which are also the same techniques used in solving combinatorial problems, e.g., finite-domain constraint satisfaction problems (CSPs). We propose CSP instances as more challenging benchmarks to evaluate the effectiveness of the core techniques in symbolic execution. We transform CSP benchmarks into C programs suitable for testing the reasoning capabilities of symbolic execution tools. From a single CSP P, we transform P depending on transformation choice into different C programs. Preliminary testing with the KLEE, Tracer-X, and LLBMC tools show substantial runtime differences from transformation and solver choice. Our C benchmarks are effective in showing the limitations of existing symbolic execution tools. The motivation for this work is we believe that benchmarks of this form can spur the development and engineering of improved core reasoning in symbolic execution engines.

I. INTRODUCTION

Symbolic execution [1] is one of the most powerful techniques for bug finding and program testing. It is successfully used on real-world code [2]–[4]. It is also used in reverse engineering [5] and program repair [6]. Many real vulnerabilities have been found using either coverage-based fuzzing, e.g., AFL [7], or symbolic execution. In the Darpa Cyber Grand Challenge, combining both fuzzing and symbolic execution was found to be effective [8], [9].

Our goal is to help improve symbolic execution through benchmarking, in particular, benchmarks which exercise the core reasoning techniques used which are essentially constraint solving, path exploration, and search. While path explosion [8], [10] is a challenge for symbolic execution, we show that systematic variation of problem difficulty with benchmarks combining constraint solving and path exploration is a way of measuring the performance of a symbolic execution engine. Constraint solving is also reported to be the dominant cost in symbolic execution of non-trivial programs [11]–[13]. Programs which naturally exhibit such challenges are when there is difficulty to invert computation such as a cryptographic computation or deliberately obfuscated code (including malware) which has been written intentionally to be difficult to analyze [14]. One intrinsic reason for this difficulty is that such code is combinatorially difficult for symbolic execution (both path exploration and/or constraint reasoning).

In this paper, we propose that the intrinsic path conditions challenges which underlie symbolic execution can be exercised and evaluated directly by using benchmarks of non-trivial combinatorial problems. We are also motivated by how “hard benchmarks” from the SAT competition have been used to drive the development of SAT solvers.

There are also other practical issues which can challenge symbolic execution such as system calls, memory modeling, heap reasoning [10], [15]—these are orthogonal to this paper. Our focus is on evaluating the core reasoning power of tools based on symbolic execution independently from the other challenges above. Unlike typical benchmarking of symbolic execution tools which use existing programs, we design special synthetic benchmarks to deliberately “stress test” path conditions (constraint solving), path explosion, and exploration.

Symbolic execution relies on techniques for constraint solving, typically satisfiability modulo theory (SMT) solvers, and handling of path conditions and path exploration. Since the core techniques rely essentially on constraint solving and search, we propose the use of discrete combinatorial problems for benchmarking. One well-known framework for discrete combinatorial problems is Constraint Satisfaction Problems (CSP). We transform CSP instances to C programs to create our benchmarks. The transformation makes reaching a particular program point the same as determining the satisfiability of the underlying CSP (and also finding a solution). We give several transformations (into C code) for the same CSP resulting in different C programs. Ideally, as the CSP is the same, different tools should have similar runtimes. The results show this is not the case, justifying how our benchmarks can show weaknesses in the core reasoning of the analysis tools. The benchmarks also test the sensitivity of symbolic execution to different programs for the same problem and scalability (where reasoning becomes more difficult).

Our preliminary experiments are on the following tools, which analyze C programs using LLVM [16] as the common representation of the program: KLEE [2] varying the constraint solvers (STP [17] and Z3 [18]); Tracer-X [19] which adds interpolation to KLEE; and the LLBMC [20] bounded model checker (LLBMC can be considered as “static” symbolic execution [21]). The experiments show that KLEE-based tools are much more sensitive than LLBMC to the C constructs and structure of the code. We also show that the transformation (encoding) can affect different solvers in rather different ways. As our benchmarks involve intrinsically harder path exploration, preliminary experiments suggest that the execution times of the KLEE-based systems depend more heavily on the solver used than on the sophistication of the search the technique used, for instance, it seems that interpolants and abstraction learning does not yet have sufficient gains. There are also substantial differences in running times by merely replacing the solver used for the same tool.

The paper is organized as follows. Related work is presented in Section II and background on CSP in Section III. The transformation of CSP benchmarks into C programs is given in Section IV. Experimental results on the C benchmark suite are given in Section V and Section VI concludes.

II. RELATED WORK

Symbolic execution generalizes program instructions into mathematical relations (constraints) between the state before and after each instruction [1]. In this way, instead of actually executing each instruction, a symbolic execution system collects the constraints that are translated from each instruction into a conjunction, called the path condition, which can be tested for satisfiability using constraint solvers, e.g. satisfia-bility modulo theories (SMT) solvers [22], such as STP [17] or Z3 [18]. Symbolic execution is widely used for whitebox fuzzing, the advantage is that each program path is only tested once, whereas traditional black-box fuzzing may exercise the same path multiple times.

KLEE [2], [23], [24] is one of the most well-known symbolic execution tools and shown to be effective. A critical challenge in symbolic execution is to make it more scalable. One approach is Tracer-X [19] which extends KLEE with learning (interpolation) to prevent redundant execution [25] to mitigate path explosion. A recent work, chopped symbolic execution [11], also reduces path explosion by extending KLEE so that parts of the code can be manually excluded by executing them lazily during symbolic execution to make overall execution faster. In our benchmarks, the programs are transformations of non-trivial CSP benchmarks such that there is no obvious part of the code, i.e., the transformed C program, that can be removed from consideration. Thus, chopped symbolic execution is not applicable. In this paper, our benchmarks are C code, hence, we focus on KLEE-based systems.

Other approaches related to symbolic execution are DART [3], CUTE [4], SAGE [26], and S2E [27]. These can be categorized as concolic testers, where the construction of the path condition follows an actual execution of the program. Surveys of symbolic execution are in [10], [28].

There has been work to measure the capabilities of symbolic execution tools. Several challenges are identified with concolic testing, in particular, constraint solving [29]. Benchmarking of symbolic execution tools with logic bombs, code fragments which can cause difficulty for symbolic execution such as buffer overflow, arithmetic overflow, etc. is proposed in [30]. We differ in that we address the challenge which occurs in

Fig. 1. Example CSP Constraints in XCSP3 Format

the simplest code, this is different from challenges which can arise from dealing with undefined behavior in C such as buffer overflow. Another challenging problem for symbolic execution is how to handle cryptographic functions [31], [32], the combinatorial challenges are related to the CSP benchmarks we propose. Evaluation of symbolic execution tools applied to industrial software is in [33] and considers limitations in handling language features. Language features are also tested in [34] using Java and .NET test generators.

III. CSP BACKGROUND

One formulation of constraint problems are Constraint Sat- isfaction Problems (CSP). A Constraint Satisfaction Problem (CSP) P is a pair (X,C) where X is a set of n variables {x1,...,xn} and C a set of e constraints {c1,...,ce} [35], [36]. In this paper, we focus on finite discrete combinatorial problems. In a finite domain CSP, variables x X take values from their domain D(x) which is a finite set of values. Each c C has two components: a scope (scp(c)) which is an ordered subset of variables of X; and a relation over the scope (rel(c)). Given scp(c) = xi1,...,xir,relxij) is the set of satisfying tuples (combinations of values) for the variables in scp(c). A constraint is satisfied if there is valuation for the variables from at least one of the tuples in its relation which takes values from the variables’ domain. A solution to P is a valuation for X such that every constraint is satisfied. P is satisfiable iff a solution exists. If no solution exists then P is unsatisfiable.

A constraint can be defined in two ways: (i) intensional; or (ii) extensional (also known as table constraint). An intensional constraint is one where the relation of the constraint is defined implicitly. This is a common form of constraints supported in constraint solvers, e.g., a linear arithmetic inequality is implicitly understood in the usual arithmetic fashion over either real numbers or integers. Constraints can also be defined extensionally—the relation is defined as a (positive) table giving the satisfying tuples of constraint c (respectively, also a “negative table” defining the set of “negative tuples”, namely, the tuples not satisfying the constraint). Consider the constraint x+y = 1 where the domain of the variables is binary. The equation above is already in the intensional form of the

TABLE I TRANSFORMATION VERSIONS AND FEATURES USED. DETAILS OF TRANSFORMATIONS AND FEATURES ARE ON THE LEFT OF THE (STATISTICS ARE ON THE RIGHT OF (AVERAGE NUMBER LINES OF THE C CODE AND NUMBER OF VARIABLES OF THE CSP INSTANCES.

constraint. The positive extensional form is given by the tuples (respectively, the negative extensional form is ) for the variable sequence .

As our benchmarks used the XCSP3 format, we briefly describe XCSP3. Figure 1 illustrates three kinds of constraints in XCSP3 format [37], [38]. XCSP3 is an XML format for defining CSP instances which can be used with many constraint solvers. It is also used in the XCSP3 competition for constraint solvers [39]. We use XCSP3 problem instances in our experiments. The details of the XCSP3 format are beyond of the scope of the paper, so examples in XCSP3 are meant to be illustrative only.

We consider CSPs where the constraints are in: (i) intensional form, and (ii) extensional form (see Figure 1 (a) which defines a negative table with two constraints on x0,x1,x2and x3,x4,x5using the same table definition). Some examples of intensional constraints are: eq(%0,dist(%1,%2)) representing x for a given substitution of x, y, and z (see Figure 1 (b)); and the alldifferent constraint (see Figure 1 (c)), which constrains the variables x0,x1,x2to all take different values. We remark that alldifferent is considered a global constraint but in this paper, what matters is the intensional versus extensional distinction of the constraint.

IV. TRANSFORMING COMBINATORIAL PROBLEMS TO C

The idea behind turning a CSP into a C program is as follows. Solving a combinatorial problem can be imagined as two components: firstly an oracle guesser for a solution; and secondly a checker for the solution. We can think of symbolic execution as finding a solution which the oracle could have returned and we transform the CSP into a C program that functions as the checker.

Our approach is to transform a combinatorial problem, a CSP P, into a C program Pto be tested on program analysis tools as follows: (i) the finite domain variables of the CSP correspond to integer variables in the program P(C variables are treated as symbolic); (ii) CSP variable domains are converted to assume statements (see below); and (iii) the constraint relations are encoded into conditional statements in the program or as assume statements. The encoding of the CSP to C ensures that when the CSP P is satisfiable (symbolic) execution of Pis able to reach a distinguished program point—the values of the C variables in Pare a solution to CSP P. To test the analysis tools, the distinguished program point is mapped to an assertion failure, i.e. assert(0). Similarly, when CSP P is unsatisfiable, the assert cannot be reached from any path, i.e., all paths will not succeed.

From a single CSP P, we propose several transformations of P into C programs.The purpose of various transformations which result in different programs is to exercise the tools in different ways (as we show in the results). We employ two general approaches for encoding the constraints in P:

1) if approach: if statements are created, whose condition captures the values of variables satisfying the constraint. The execution can terminate with an exit(0) statement in their else branches. If it is possible that execution takes the then branches, then there will be satisfying values for the variables of the encoded constraint. In symbolic execution, the condition is simply added into the current set of constraints, i.e., the path condition.

2) assume approach: the constraint solver of the analysis tool is used directly—the constraints of the CSP P are translated into an argument to assume state-

(d) Intensional Version 1 (dist) if ((y0==dist(x0,x1) && y1==dist(x1,x2))); else exit(0);

Fig. 2. Example transformed constraints with version numbers from Table I. Version 1 is the full transformation, the others are only the constraint.

ments (klee_assume of KLEE or __llbmc_assume of LLBMC). When an assume statement is executed in symbolic execution, the constraint is simply added into the path condition and not intended to be executed as C, whereas the if is C code.

When all constraints have been translated, we place an assert(0) at the end of the program (the distinguished program point). The assert(0) triggers an assertion failure, so if the CSP P is satisfiable (no contradiction is found), the program terminates with an assertion failure. The generated test case from symbolic execution, e.g., KLEE or Tracer-X, is a solution for P. If the CSP P is unsatisfiable, assertion failure will not be possible and will be reported as a failure.

Table I (to the left of the double vertical line) shows how different features are combined to obtain different transformed (versions) C programs for a CSP P. Overall we have designed 12 extensional transformations for extensional CSPs and 10 intensional transformations for intensional CSPs.1 Each transformation version (different row in Table I) employs different C constructs on the same underlying CSP P. The transformations are designed to be correct by construction. There is insufficient space to formalize the transformations, rather, we do it by example. Figure 2 (a) shows how the negative tables in Figure 1 (a) are transformed into C under the Version 1 transformation. The green coloured code in Figure 2(a) shows details of the transformation common to all versions of a particular CSP: C variable declaration, KLEE symbolic variable construction, constraining variable domain, and distinguished assertion program point. CSP variables are symbolic in KLEE and domains are encoded with klee_assume (similarly for LLBMC). Reaching the assert means the CSP is satisfiable. Similarly, Figures 2 (b), and (c) are the results of Extensional Versions 5 and 8 on the same constraint. Figure 2 (d) and (e) are from Intensional Versions 1 and 2 on Figure 1 (b), and Figure 2 (f) is from Intensional Version 9 on Figure 1 (c). Figure 2 (g) is from Intensional Version 3 combining constraints in Figures 1 (b) and (c) as a single problem.

Table I shows how the transformations in each column (Construct, Operator, Grouped) are combined to give a particular version. The Construct column gives the construct used in the C program, if/assume approaches. Figures 2 (a) and (b) are the transformation results of the if approach on the constraint example in Figure 1 (a), while Figure 2 (c) shows the result of the assume approach, which uses an assume statement (example is for KLEE using klee_assume, whereas when testing on LLBMC, __llbmc_assume is used).

The Operator column gives the logic operators used in the conditions, C logical operators (&& or ||) or C bitwise operators, which are used to combine the expressions making up the conditions. The difference between the logical and bitwise operators is that the logical operators are from C-style short-circuiting of conditions, i.e. the condition is broken up into a cascading conditional branching structure with only a atomic condition guarding every branch. Atomic conditions do not contain logical or bitwise operators. Figure 2 (b) and (f) shows the usage of bitwise operators in the transformation, while Figure 2 (a), (c), (e) and (g) show the usage of logical operators. For Intensional Transformations, when there is no grouping of conditions, no operator is needed (denoted by NOP in Version 1 and 6) in Table I and illustrated in Figure 2 (d). The CSP may have several constraints defined in the same way but on different variables, i.e. in Figure 1 (a), there is a single constraint group with two individual constraints.

The Grouped column shows how translation is grouped by constraints: the translation is per constraint group (yes); per individual constraints defined in the group (no); or the entire CSP P is grouped together as a single condition (all). For example, Figure 2 (a) and (d) show the no case, Figure 2 (g) shows the all case given the constraints in Figures 1 (b) and (c), and the rest of Figure 2 show the yes case. Figure 2 (g) simply combines the assert(0) statement in the then body of the if statement. We remark that as the benchmarks are for testing path explosion and constraint solving, the transformations produce straight-line code. However, it is straightforward to make more compact forms with loops as well, but that only makes symbolic execution more complex.

V. EXPERIMENTS

A. Experimental Setup

We present experiments evaluating selected symbolic execution tools on the C programs created from our transformations of extensional and intensional CSP problems from the XCSP3 benchmarks [38]. We tested three classes of extensional constraint problems: AIM-100 and AIM-200 (each consists of 16 satisfiable and 8 unsatisfiable problems), and Dubois (30 unsatisfiable problems). For intensional CSP problems, we used the All-Interval (19 satisfiable problems), CostasArray (8 satisfiable problems), and Haystacks (14 problems) problem classes. The intensional constraints used in these CSP instances are shown in Table II. Statistics on the C benchmarks are given on the right of (||) in Figure 2 giving the average number of lines of code in the C programs and the average number of variables of the CSP instance for the problem classes tested. These problem instances from the XCSP3 benchmarks were chosen primarily because the transformation results are comparatively smaller than other instances (e.g., most other extensional XCSP3 benchmarks gave much larger C programs which could not finish running in a day). All C programs in our benchmark suite are compiled into common bytecode (LLVM bitcode) instances using LLVM’s clang at -O3. Our benchmarks are publicly available [40].

Experiments were run on a 3.40 GHz Intel i7-6700 32GB RAM machine running Ubuntu 16.04. We focus on C code and only evaluate tools which are available for C. To keep things on a common footing, we experimented with the following LLVM-based tools analyzing the common compiled C instances: KLEE [2] with STP and Z3 as the backend SMT solvers, Tracer-X [19] which uses only Z3 as the backend solver, and LLBMC, which uses STP as the backend solver. KLEE is used as it is perhaps the most well-known symbolic execution engine and is commonly used, at least 130 papers use KLEE [24]. It was 2nd placed, highest systematic tool, in the 2019 Test-Comp competition. Tracer-X builds upon Tracer [41], extending KLEE with abstraction learning [42] and interpolants [43] which can prune paths to mitigate the path explosion problem. LLBMC [20] is not a symbolic execution engine, it is a bounded model checker, but it also addresses the same core issues (constraint encoding, constraint solving, solver search space) and uses the same underlying solvers (STP). The results show LLBMC to be a relevant and useful comparison. All tools used a timeout of 1000 seconds.2

TABLE II INTENSIONAL CONSTRAINTS USED

We also ran the AbsCon CSP solver [44], [45] with default options on the original CSP instances as a baseline comparison for the XCSP3 instance. AbsCon is a Java-based constraint solver using the XCSP3 format. As such, no further input encoding or transformations are needed for XCSP3 instances, making it a convenient comparison. We use the AbsCon baseline to make it easy to present the timings—normalizing the execution time of the tools to the AbsCon timing.3

We evaluate the benchmarks on the analysis tools on: (i) transformation robustness; and (ii) tool scalability. Transformation robustness measures how varying the transformations used on a single problem instance P changes the execution time. Ideally, if the tool is robust with respect to the transformations, the timing variations should be small, i.e., it is insensitive. We highlight that the different transformed C programs are on the same instance P, hence, the programs can be “considered equivalent”. One might expect (or even demand) a certain level of robustness for equivalent code. However, the results show that the tools are not robust with respect to transformation.

Tool scalability is the change in tool analysis time as the size of the problem changes (the chosen scalability benchmarks have a size parameter). In the scalability experiments, varying the size naturally changes the problem instance but the problem and code structure remains similar. We remark that although the AIM problems vary the number of variables, we do not use them for scalability as they do not have a structural scaling parameter defined by the problem, i.e. the number of variables may not be a fair parameter.

B. Evaluation

Figure 3 investigates transformation robustness for the extensional problems. We use average timings for analyzing the C code normalized to AbsCon solving time of the XCSP instance. Different transformations result in a different number of paths explored for the KLEE-based tools. We see that the KLEE-based tools are sensitive to the transformations and exhibit considerable variation, up to 2 orders of magnitude slower for certain transformations. They also exhibit a similar pattern in the transformation running times between AIM-100 and AIM-200 instances. Due to lack of space, results for Dubois are not shown, but it has a similar pattern to AIM instances. The timings show that there can be a complex relationship between the solver and performance of transformation. Sometimes KLEE/Z3 and Tracer-X/Z3 are slower than KLEE/STP but in others, it is faster (AIM-200 Version 4), showing the importance of the solver effect. (We remark that KLEE and STP are co-designed while Z3 is independent).

Fig. 3. Extensional Problems Transformation Robustness

Fig. 4. Extensional Problem Scalability - Dubois

Greater differences can be seen in the intensional benchmarks. We can see that generally, the greater the grouping (Table I Grouped column), the better the performance, in order of no to yes to all. There is a large difference between logical versus bitwise operators, e.g., Version 1 versus Version 4.

We see that LLBMC is significantly more robust under all transformations. This may be because LLBMC creates a single (SMT) constraint problem. The LLBMC runtime relative to Abscon in Figure 3 for AIM-100 is between 0.15-0.22 and for AIM-200 between 0.17-0.26, i.e., LLBMC is faster.

The results for the transformation robustness of intensional problems is shown in Figure 5. While LLBMC is still often better than KLEE, there is a larger variance. For example, All-Interval Version 9, KLEE(STP) is better than LLBMC but Tracer-X, which is KLEE-based, is similar to LLBMC. In the extensional code, the effort in Tracer-X to compute interpolants turned out to be wasted, but the value in the reduction of paths can be seen here where Tracer-X can outperform KLEE(STP/Z3). Here, STP is generally faster than Z3 compared to the extensional problems which only use equals or not-equals constraints. Haystacks shows an interesting solver issue, the only transformation difference between Version 3 and 8 is the if versus assume—one might think that the assume (more direct solver usage) should handle the constraints better, but it turns out to be worse. We conjecture this may be due to differences in the handling of the short-circuiting from the logical operators. Tracer-X behaves better for these instances.

Figures 3 and 5 use running time normalized to AbsCon running time. This shows that the KLEE-based symbolic execution tools are significantly slower than running the original CSP problem with the AbsCon solver. Note that this is not intended to be a fair solver comparison, all the systems use some solvers, but the solvers are used in different ways and the solvers themselves employ different techniques. LLBMC is often faster than AbsCon for the AIM problems. One possible reason is that the AIM instances are SAT problems and the AIM instances are known to be easy for modern SAT solvers. LLBMC uses STP, which in turn, uses the MiniSat SAT solver [46] can be expected to give good results. The AbsCon solver uses generalized arc consistency for the extensional constraints and does not employ learning. The performance of SAT solvers, on the other hand, is highly dependent on the use of clause learning. The intensional benchmarks show that STP solver performance is different from the extensional benchmarks, now STP is generally faster than Z3 but AllInterval Version 2 shows this is not always the case.

We investigate tool scalability with extensional (Dubois) and intensional problems where there is a size parameter. The results are shown in Figures 4 and 6. In these graphs, a greater problem number (x-axis) indicates a larger size parameter and the y-axis gives the number of timeouts (more timeout is

Fig. 5. Intensional Problems Transformation Robustness

worse). As can be observed, LLBMC is more scalable than the others, with KLEE/STP slightly more scalable than the two KLEE-based tools with the Z3 solver. We can see for the extensional problems that although Tracer-X was generally slower in the AIM benchmarks, it can solve more problems compared with KLEE(STP/Z3).

In summary, we see that LLBMC generally executes the benchmarks faster than other tools. It is known in the literature that the invocation of the constraint solver is expensive (e.g., [47]). During path enumeration, KLEE-based tools invoke the constraint solver multiple times to determine the feasibility of control-flow branches. Compared to the KLEE-based tools, LLBMC invokes the solver only once with a single global (much larger) problem. The versions run on KLEE-based systems which correspond most closely to LLBMC solver usage are: Extensional Versions 6 and 12 and Intensional Versions 5 and 10. As can be seen from Figures 3 and 5, these versions show much lower execution times for the KLEEbased tools compared with the other transformations.

Our experiments may give a different impression from conventional use of KLEE on real-world C programs, e.g., KLEE may be expected to work on small programs but may

Fig. 6. Intensional Problems Scalability

not scale to large ones; however, the Dubois C programs are small (hundreds of lines) and still timeout. This may be because solving and path explosion is more pronounced in our “benchmark code” versus “typical code”. However, the choice of artificial code is deliberate—to allow us to study the tradeoffs and interactions between the solver and path exploration. Real-world programs may pose less of a challenge to this aspect, but can still pose challenges in other aspects, such as system calls. The results show that the execution times of the KLEE-based tools on the benchmarks depend heavily on the solver used (STP or Z3) and are significantly affected by the C constructs (from the transformations). Although both LLBMC and KLEE/STP use the same solver, the timings differ substantially, which suggests that KLEE-based symbolic execution has considerable room for optimizations. Our benchmarks are designed so that the reasoning on the program execution paths are nontrivial. Tracer-X uses interpolants which can prune (irrelevant) execution paths, however, in our benchmarks, interpolants can have some benefit but the cost of interpolants is also a drawback. This suggests that the use of interpolants may not help so much in the harder path exploration cases arising in our benchmarks.

C. Discussion

We summarize how the experimental results can help address the performance of symbolic execution with the following research questions:

• Q1:What kinds of programs are easier/harder to analyze with symbolic execution? The experiments show that KLEE/STP, KLEE/Z3, and Tracer-X/Z3 are not robust under different transformations, thus, programs coded in a particular way will be harder to analyze.

• Q2:Do different constraints change the difficulty of the symbolic execution? The benchmarks use various constraints. The extensional benchmarks use simple equality constraints of the form x = c or x c, while the intensional benchmarks have more complex constraints, e.g. 0 . There is a larger variation in the overall robustness with the intensional benchmarks.

• Q3:Are there differences in overall solver performance between different solvers? The experiments show solver choice is significant, e.g., between STP and Z3; however, this relationship is complex and may not be simple to predict. From anecdotal experience, KLEE/STP is expected to outperform KLEE/Z3; however, for certain kinds of code, KLEE/Z3 can be much faster.

Overall, the results suggest that considerable improvements to symbolic execution tools may be possible, such as better solver integration and improved solver reasoning. While Tracer-X was in many cases slower than KLEE, possibly due to the interpolation overhead, there were cases where it was faster, which suggests that smarter removal of execution paths is beneficial but much research remains to be done as shown by the overall results. As our transformations can be considered a kind of pre-processing step, a smart pre-processing step may be another direction for improving symbolic execution. This may be analogous to some constraint solvers which perform extensive pre-processing, for example, the pre-processing phase in the CPLEX MIP solver is sufficient to solve some problems without going to the actual solver [48]. As far as we are aware, such more advanced constraint processing ideas have not been employed in symbolic execution.

We highlight that the objective of this paper is to build benchmarks suitable to improve the core reasoning of symbolic execution. So while our benchmarks generally have bounded model checking (LLBMC) being faster, this is partly an artifact because we have simple systematic code which is loop-free. The comparison with LLBMC is also because many of the core reasoning components are similar to KLEE and Tracer-X, in particular, the same underlying solver. We found instances where KLEE/STP outperforms LLBMC. For general and more complex code, symbolic execution has advantages over bounded model checking. The main limitation of bounded model checking being the bound which also affects scalability if the bound is large. Symbolic execution also has advantages when it comes to system calls and mixing with concrete execution. However, for our benchmarks, we want to deliberately focus on the solver, path exploration and search, so these other factors are orthogonal to our work.

While we have deliberately chosen simple transformations, more complex ones are possible. Rather than loop-free code, extensional constraints can be transformed into a generic table driven code with loops. This may not work so well on LLBMC if the specified bound is exceeded but would make little difference to KLEE. We have chosen a simple code structure to focus on the solver, path exploration and constraint handling mechanisms intrinsic to symbolic execution.

VI. CONCLUSION

We proposed finite-domain constraint satisfaction problems (CSPs) for evaluating the effectiveness of the reasoning techniques in symbolic execution tools. We show how to encode CSP instances by transforming the problem into C programs using different C constructs. Although multiple C programs are created, they are essentially equivalent from the satisfiability and solution of the CSP perspective. Preliminary testing with the KLEE, Tracer-X, and LLBMC tools with STP and Z3 solvers show that the transformation used significantly affects the performance of the tools even though the underlying reasoning problem is the same. The degree of sensitivity of symbolic execution tools to “writing style” of the program has not been studied, and we feel is an important direction for improved and more robust symbolic execution technology.

The use of logical operators common in C has a significant effect compared to the use of bitwise operators. It suggests that a viable strategy may be to flip the semantics of the code between bitwise and logical operators in cases where there are no side effects. This could give a performance boost—we are not aware that this strategy has been tried before.

Ideally, as the programs are “equivalent to the CSP instance”, we may expect performance from the tools to be not dissimilar since the underlying reasoning techniques are also similar. However, our benchmarks also show complex and substantial differences. There is also a large performance gap between LLBMC and KLEE/STP even when both use the same underlying STP solver, showing another direction for system improvement. Although our benchmarks are synthetic, precisely because they focus on the basic core reasoning engine, we believe that benchmarks of these form can help drive basic improvements to symbolic execution tools. What we would like is that the analysis tools are more robust and more scalable. We believe that if symbolic execution-based tools can perform better on the kinds of benchmarks we proposed, this will carry forward to improved overall analysis performance. In particular, for challenging code such as cryptographic code, obfuscated code, and malware where program analysis tools may not work well.

The results using Tracer-X suggest that optimizations which prune paths can increase scalability but it also suggests that much more improvements are needed. Future work may be to investigate redundant constraints as yet another dimension. Finally, we hope this can spur more synergy between techniques in constraint programming, testing and program analysis.

ACKNOWLEDGEMENT

We acknowledge the support of R-252-000-592-112 and R-252-000-A39-112. We would like to thank Andrew Santosa who helped with a preliminary version of this work.

REFERENCES

[1] J. C. King, “Symbolic execution and program testing,” Commun. ACM, 1976.

[2] C. Cadar, D. Dunbar, and D. Engler, “Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs,” in USENIX Conference on Operating Systems Design and Implementation, 2008.

[3] P. Godefroid, N. Klarlund, and K. Sen, “Dart: Directed automated ran- dom testing,” in ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005.

[4] K. Sen, D. Marinov, and G. Agha, “Cute: A concolic unit testing engine for c,” in European Software Engineering Conference, 2005.

[5] V. Chipounov and G. Candea, “Reverse engineering of binary device drivers with revnic,” in European Conference on Computer Systems, 2010.

[6] H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra, “Semfix: Program repair via semantic analysis,” in International Conference on Software Engineering, 2013.

[7] “American Fuzzy Lop,” http://lcamtuf.coredump.cx/afl.

[8] Y. Shoshitaishvili, R. Wang, C. Salls, N. Stephens, M. Polino, A. Dutcher, J. Grosen, S. Feng, C. Hauser, C. Kruegel, and G. Vigna, “Sok: (state of) the art of war: Offensive techniques in binary analysis,” in IEEE Symposium on Security and Privacy, 2016.

[9] N. Stephens, J. Grosen, C. Salls, A. Dutcher, R. Wang, J. Corbetta, Y. Shoshitaishvili, C. Kr¨ugel, and G. Vigna, “Driller: Augmenting fuzzing through selective symbolic execution,” in Annual Network and Distributed System Security Symposium, 2016.

[10] C. Cadar and K. Sen, “Symbolic execution for software testing: Three decades later,” Commun. ACM, vol. 56, no. 2, Feb. 2013. [Online]. Available: http://doi.acm.org/10.1145/2408776.2408795

[11] D. Trabish, A. Mattavelli, N. Rinezky, and C. Cadar, “Chopped symbolic execution,” in International Conference on Software Engineering, 2018.

[12] T. Liu, M. Ara´ujo, M. d’Amorim, and M. Taghdiri, “A comparative study of incremental constraint solving approaches in symbolic execution,” in Haifa Verification Conference, 2014.

[13] H. Palikareva and C. Cadar, “Multi-solver support in symbolic execu- tion,” in International Conference on Computer-Aided Verifcation, 2013.

[14] B. Yadegari and S. Debray, “Symbolic execution of obfuscated code,” in ACM Conference on Computer and Communications Security, 2015.

[15] G. Duck, J. Jaffar, and R. H. Yap, “Shape neutral analysis of graph-based data-structures,” Theory and Practice of Logic Programming, vol. 18, no. 3-4, 2018.

[16] “LLVM compiler infrastructure project,” https://llvm.org/.

[17] V. Ganesh and D. L. Dill, “A decision procedure for bit-vectors and arrays,” in International Conference on Computer Aided Verification, 2007.

[18] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008.

[19] “Tracer-X: next-generation symbolic execution.” [Online]. Available: http://www.comp.nus.edu.sg/

[20] S. Falke, F. Merz, and C. Sinz, “The bounded model checker LLBMC,” in IEEE/ACM International Conference on Automated Software Engineering, 2013.

[21] T. Avgerinos, A. Rebert, S. K. Cha, and D. Brumley, “Enhancing symbolic execution with veritesting,” in International Conference on Software Engineering, 2014.

[22] L. D. Moura and N. Bjørner, “Satisfability modulo theories: introduction and applications,” Commun. ACM, 2011.

[23] “KLEE system,” https://klee.github.io.

[24] “KLEE papers,” https://klee.github.io/publications.

[25] J. Jaffar, A. E. Santosa, and R. Voicu, “An interpolation method for CLP traversal,” in International Conference on Principles and Practice of Constraint Programming, 2009.

[26] P. Godefroid, M. Y. Levin, and D. Molnar, “SAGE: Whitebox fuzzing for security testing,” Queue, vol. 10, no. 1, 2012.

[27] V. Chipounov, V. Kuznetsov, and G. Candea, “S2E: A platform for in-vivo multi-path analysis of software systems,” in Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, 2011.

[28] R. Baldoni, E. Coppa, D. C. D’elia, C. Demetrescu, and I. Finocchi, “A survey of symbolic execution techniques,” ACM Comput. Surv., vol. 51, no. 3, 2018.

[29] R. Kannavara, C. J. Havlicek, B. Chen, M. R. Tuttle, K. Cong, S. Ray, and F. Xie, “Challenges and opportunities with concolic testing,” in National Aerospace and Electronics Conference, 2015.

[30] H. Xu, Z. Zhao, Y. Zhou, and M. R. Lyu, “Benchmarking the capability of symbolic execution tools with logic bombs,” IEEE Transactions on Dependable and Secure Computing, 2018.

[31] D. Xu, J. Ming, and D. Wu, “Cryptographic function detection in obfuscated binaries via bit-precise symbolic loop mapping,” in IEEE Symposium on Security and Privacy, May 2017.

[32] R. Corin and A. Manzano, “Efficient symbolic execution for analysing cryptographic protocol implementations,” in International Conference on Engineering Secure Software and Systems, 2011.

[33] X. Qu and B. Robinson, “A case study of concolic testing tools and their limitations,” in 2011 International Symposium on Empirical Software Engineering and Measurement, 2011.

[34] L. Cseppento and Z. Micskei, “Evaluating symbolic execution-based test tools,” in IEEE International Conference on Software Testing, Verification and Validation, 2015.

[35] R. Dechter, Constraint Processing. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003.

[36] F. Rossi, P. v. Beek, and T. Walsh, Handbook of Constraint Programming (Foundations of Artificial Intelligence). Elsevier Science Inc., 2006.

[37] F. Boussemart, C. Lecoutre, and C. Piette, “XCSP3: an integrated format for benchmarking combinatorial constrained problems,” CoRR, vol. abs/1611.03398, 2016.

[38] “XCSP3,” http://www.xcsp.org/.

[39] “XCSP3 competition,” http://xcsp.org/competition.

[40] “CSP-C,” https://github.com/vsahil/XCSP3 to C.

[41] J. Jaffar, V. Murali, J. A. Navas, and A. E. Santosa, “Tracer: A symbolic execution tool for verification,” in International Conference on Computer Aided Verification, 2012.

[42] J. Jaffar, A. E. Santosa, and R. Voicu, “An interpolation method for clp traversal,” in International Conference on Principles and Practice of Constraint Programming, 1999.

[43] W. Craig, “Three uses of the Herbrand-Gentzen theorem in relating model theory and proof theory,” The Journal of Symbolic Logic, 1957.

[44] S. Merchez, C. Lecoutre, and F. Boussemart, “Abscon: A prototype to solve csps with abstraction,” in International Conference on Principles and Practice of Constraint Programming, 2001.

[45] “AbsCon,” http://www.xcsp.org/tools.

[46] “MiniSat,” http://minisat.se.

[47] Y. Zhang, Z. Chen, and J. Wang, “Speculative symbolic execution,” in IEEE International Symposium on Software Reliability Engineering, 2012.

[48] T. Achterberg and R. Wunderling, “Mixed integer programming: Ana- lyzing 12 years of progress,” in Facets of Combinatorial Optimization, 2013.