Since the success of large scale neural networks in the early 2010s, we have witnessed an explosive growth in the field. The popularity grew not only in academia but in the industry as well. Deep neural networks (DNNs) have become the de facto solution to many complex computer vision, speech recognition, and natural language processing problems. Popular as deep learning may be, building DDNs to solve realworld problems remains an arduous task. It requires a vast amount of high-quality labeled data and heavy use of computational resources and human expertise. It goes without saying that DNNs are invaluable technological assets that potentially has huge commercial impacts. Over the past few years, myriads of companies have joined the AI arms race. Just among AI startups, investment from the venture capital market reached a record high $9.3 billion in 2018 [1]. Among the companies, many provide range from commercial libraries for embedded systems, cloud machine learning APIs to building private corporate clouds for AI, spanning across industries like transportation, manufacturing, healthcare, finance, and consumer electronics.
While the DNNs are fueling commercial successes in the AI market, a void in IP protection for DNN models may hin-
Figure 1: Workflow of the trigger pattern-based black-box DNN watermarking.
der the progress. When a model owner sells a service to a customer, she should have a reliable way to prevent the customer from illegally distributing or reselling it. To achieve that goal, the owner not only needs to identify her own model when it is distributed, but also prove the ownership to a trusted arbitrator. Recently, several researchers proposed watermarking as a viable solution to the IP protection problem in deep learning [2][3][4] [5][6][7]. Digital watermarking originally refers to the process of covertly embedding information in multimedia content. The concept has since been extended to cover software [8], circuits [9] as well as DNNs. White-box watermarking embeds the owner’s information in the weights of a DNN. Black-box watermarking, on the other hand, embeds the watermark in the input-output behavior of the model. The set of input used to trigger that behavior is called trigger set. For the popular task of image classification, a common approach is to assign a random label to trigger images and train the model to classify accordingly. The non-triviality of ownership of a watermarked model is constructed on the extremely small probability for any other model to exhibit the same behavior [4]. Through detecting the watermark in a DNN model, the owner will be able to both identify and prove her ownership. Based on the characteristics of their trigger sets, existing
black-box watermarking methods can be split into two categories. The first category of methods curates a finite set of special trigger images. The special images can be completely random [4], samples derived from unused hidden space [5], or adversarial examples [3]. Another category of methods maintain trigger patterns and add them to natural input images to create trigger sets. The trigger patterns are usually meaningful patterns that can serve as a proof of the owner’s identity, such as the logos [6] and color-coded keys [7]. Figure 1 describes the workflow of the method. Some sample trigger images are shown in Figure 2.
The motivation for designing the trigger sets are different between those two categories of methods. The first is focused on the functionality of the model. They aim to create trigger sets such that watermarking is as orthogonal to the normal functionality of the model as possible. The second, on the other hand, is more focused on the watermarking extraction procedure. Associating the owner’s identity with the trigger set makes detection and proof of ownership much more straightforward. The evident drawback of the first category is the difficulty to establish a connection between the trigger set and the owner. To solve that problem, researchers went as fars as to use complex cryptographic tools [4]. Further, the limited size of the trigger set weakens the proof of ownership. The drawback of the second category lies in an inevitable tradeoff between the robustness of the watermark and potential of false positive watermark detection. If a trigger pattern is too prominent, then it risks triggering false positives in other neural networks. On the other, if a pattern is too inconspicuous, it may be easily removed during fine-tune attacks.
In this paper, we aim to bridge the gap between the different trigger set generation methods. We propose a differential evolution-based framework to determine how any given trigger pattern should be added to the image such that false positive detections are reduced while the robustness of the watermark is maintained. With our framework, trigger pattern-based watermarking adds the model functionality to its equation, while still keeping ownership proofs simple. The contribution of our paper are as follows:
• We proposed an evolutionary algorithm-based framework to optimize trigger patterns in order to facilitate robust and low-false-positive black-box watermarking
• We surveyed and compared existing trigger set generation methods and presented our analysis
• We implemented our method with popular DNN models and datasets and evaluate its performance
The rest of the paper is organized as follows: Section 2 describes the watermarking problem in more details and de-fines the problem. Section 3 presents our algorithm. Section 4 evaluates the performance of the proposed algorithm.
In this section, we will first introduce the background of watermarking. Then we will define our problem. Similar to most of the security-related literature, we use the Alice / Bob narrative to describe the scenario. Alice will be the model owner. Bob will be the customer who buys the model from
Figure 2: Examples of trigger inputs. (a) a random out-of-distrbution image [4], (b) a regular image with a logo [6], (c) a regular image with a color-coded key not perceptible by the eye [7].
Alice, and also the malicious attacker who tries to infringe on Alice’s IP rights.
2.1 DNN Watermarking
Table 1: Criteria for evaluating DNN watermarks.
We define DNN watermarking as the process of covertly embedding information in the DNN in order to verify and prove an owner’s ownership. We focus on black-box watermarking for image classification, which achieves the aforementioned goal by embedding special input-output patterns in DNNs. To embed the watermark, Alice will train a DNN with both the regular dataset and the trigger set with specific output labels. To detect the watermark, Alice will use a subset of the trigger set as the input to the DNN and observe the output. There will be a positive detection if certain probability requirements are met.
A successful watermarking method has to meet several criteria regarding its effectiveness, fidelity, false positive rate, and robustness. A detailed description of the criterion is presented in Table 1. First, the effectiveness criterion states that a watermark has to ensure successful and consistent detection. The fidelity criterion states that watermarking cannot have a significant negative impact on the regular functionality of the model. False positive rate and robustness will be discussed separately in the following subsections.
2.2 Proof of Ownership
A watermarking method’s ability to prove ownership mainly relies on its low false positive rate. Suppose that Alice decides that a watermark detection is positive if there are at most misclassifications among N trigger images. Then the probability of the detection can be calculated as follows [7][5], assuming independence between the classification of each trigger image.
represents the accuracy of the model on the trigger set.
The ownership is established based on the fact that Pr is disproportionally small for a non-watermarked neural network. If a watermarking method incurs a high false positive rate (a high for non-watermarked models), then Pr is no longer small and that the proof of ownership will be inconclusive at best.
2.3 Threat Model
In this subsection, we introduce our notion of robustness by defining our threat model. We assume that Bob has white-box access to the model, but does not have access to the training set. Instead, Bob has access to some proprietary test data (i.e. a subset of test set). We argue that proprietary data is one of the most important competitive advantages of Alice, and an IP pirate Bob by no means should have access to it. Otherwise, with the training data and the model, he might as well train a new model on his own, especially when he has the ability to carry out sophisticated attacks such as fine-tuning. On the other hand, it is a reasonable assumption that the attacker may have white-box access to the model architecture and parameters. In the case of cloud ML service, Bob can be a malicious service platform. In the case of software ML libraries, Bob can be a hacker. In both cases, Bob would have the full white-box access to the model. We also assume that Alice only has black-box access to the model. In addition, Alice will have direct access to input to the model. There are no preprocessing stages between Alice’s input and the input of the model.
With some test data and the model, Bob may fine-tune the model to produce a slightly different version of it. That is called the fine-tune attack. After the fine-tune attack, Alice’s watermark should still exist. Some researchers also discussed overwrite attacks, where Bob tries to embed his own watermark using the same procedure on Alice’s model. It is indeed a very reasonable attack scenario. In our experiments, we found that embedding a new watermark using Bob’s limited amount of data would adversely affect the model’s performance, rendering the model much less valuable. Thus we rule out the possibility of Bob carrying out overwrite attacks.
2.4 Problem Definition
A DNN for classification is a function . Given an input
, it is desired that the function classi-fies it correctly to its label y, f(X) = y.
A pattern has the same dimensions as X, but is much more sparse. In its image form, P’s non-zero entries can be considered as a set of K pixels with explicitly designed values and coordinates
is tightly coupled with the identity of the model owner. And the absolute and relative coordinates of the pixels may or may not contribute to P’s ability to carry information. The pattern can be embedded on any input X from the intended data distribution to convert it into a trigger input through a function g(X, P) 2. A watermarked DNN will be trained to classify g(X, P) to
. The fact that a DNN model classifies the trigger inputs disproportionally correctly can serve as a unique proof of the owner’s identity.
We consider two alternative approaches to create trigger patterns. In the method proposed by Guo et al. (shown in Figure 2(c)), a color-coded key serves as the trigger pattern [7]. They embed the pattern by offsetting the pixel values of the input, g(X, P) = X +P. Since the information is mainly ingrained in the pixel values, we consider the pixel locations to be flexible. We use Key throughout the paper to denote this trigger pattern. The second approach we consider is proposed by Zhang et al., and shown in Figure 2(b) [6]. The information is obviously contained in the geometrical shape of the logo and the pixels have to remain in a relatively fixed to each other. Thus its location can be represented by its top left corner
. The author did not explicitly say how they embed the logo, but we interpret it as blending with the input,
. We denote the second type of trigger pattern as Logo.
Our main goal is to find the such that the probability of a non-watermarked DNN
classifying a trigger input to its original labels is maximized. The main motivation behind the goal is to minimize false positive watermark detection. Empirically, given dataset D, then the goal can be expressed as follows.
We have found that larger leads to more robust watermarking, although it leads to higher false positives. In the Key-related experiments,
is given. But we can also integrate
into the optimization landscape as follows. The Logo-related experiments use this objective function.
In this section, we first provide a high level overview of why we chose the DE framework and how it works. Then we delve deeper to provide some algorithmic details that are crucial to the convergence of DE.
3.1 Differential Evolution
To find the pattern P, the first methods that came up to us were the gradient-based methods commonly used for find-ing adversarial samples [10] [11] [12]. A key difference between our problem and theirs is that our pattern P is universal. Therefore, finding the gradient of individual inputs hardly helps our situation. The family of evolutionary algorithms are
among the most prominent non-gradient-based optimization methods. We initially relied on the generic evolutionary algorithm (EA), but we were unable to find a reasonable set of parameters to make the algorithm converge. That is when differential evolution (DE) presented itself as an alternative.
DE is a metaheuristic search algorithm that optimizes a given objective by evolving a population of candidates in parallel [13]. It follows the concept of generic EAs where a population of candidates evolves, and the candidates that are fittest will survive in each iteration. However, DE is simpler and it is known to facilitate faster convergence to the global optimum. Instead of using mutations and crossover between two parents, candidates DE evolve over a triplet. A new candidate is created by adding a weighted difference between two candidates to the third.
Algorithm 1 presents the high-level procedure of using DE to solve our problem. The main idea is to generate 1) new pixel coordinates 2) pixel values of P using the differential variation operation described in Equation 4. The FITNESS function can be either of the two objective functions described in Section 2.4. The EVOLVE function, on the other hand, is more complex. We describe more details of the function in the next subsection.
3.2 Optimizations for DE
We use two different variants of EVOLVE functions for the two existing trigger patterns, Key and Logo. For Logo, the EVOLVE function is straightforward. We use the top left pixel of the logo as the anchor, and each candidate can be represented by a simple triplet . We use DE to evolve and select the location of the logo as well as its pixel values. We use a different approach for Key. Since pixel locations are al flexible, the candidate will be an array of K tuples
. When we evolve using three candidates, each with K pixels, which pixels should pair up and evolve becomes an important question. If pixels are randomly paired up, it is likely that the pixels will engage in a Brownian motion-like movement across different generations. Consequently, as we empirical show later, the evolution will not converge. If our goal is to
Algorithm 2 Evolve with Closest Triplet
evolve the pixels into optimal locations, then it makes sense to induce the evolution in such a way that pixels nearest to an optimal location will move toward that location. To that end, we propose an algorithm to pair closest pixels together to evolve. The most efficient implementation is to store all pairwise distances in heaps and always pair available pixels with the smallest distances. Algorithm 2 describes implementation in more details. The time complexity of the algorithm is , where K is the number of pixels.
Figure 3: Best fitness of the population across the generations.
Figure 3 shows the fitness of the best candidate over the different generations. The FITNESS function is simply the accuracy of the subset. The proposed method, closest triplet evolve function, converged much faster than the evolve function where pixels are randomly paired together. In fact in the latter case, the fitness plateau at around 0.89 and it is unclear whether it will converge at all.
In this section, we report the performance evaulations of our method. We first describe implementation details of the watermarking procedure and the DE algorithm. Then we evaluate the effectiveness, fidelity, false positive rate and robustness of the watermark in following subsections. Since neither Logo nor Key is an original idea from this paper, we omit many repetitive experiments for brevity. The key is to demonstrate the ability of our DE algorithm to reduce false positive while maintaining the robustness of the watermark. It is worth noting that all of our trigger sets are built from test data that has not been used during training.
4.1 Differential Evolution
Figure 4: Output of our DE algorithm. (a) The Key trigger pattern on CIFAR-10, (b) the Logo trigger pattern on CIFAR-10, (c) the Key trigger pattern on MNIST, (d) a sample using the trigger pattern in (a), (e) a sample using the trigger pattern in (b), (f) a sample using the trigger pattern in (c).
We applied our DE-based approach on both Logo and Key trigger pattern generation. We use Equation 3 as the fitness function for Logo, where both pixel locations and values are optimized. The weight is set such that a maximum v constitutes 5% of the fitness score. With the Key pattern, we only optimize the pixel locations. In the fitness functions of both DE algorithms, we evaluate the accuracy of candidates on a random set of 640 training images. The model and dataset are described in the next subsection.
We carried out experiments on both the MNIST dataset and the CIFAR-10 dataset, and trigger patterns and trigger set images output by the DE algrithm are shown in Figure 4. To create Logo patterns on the CIFAR-10 dataset, we replicated the experiments by Zhang et al. [6]. Surprisingly, the optimal
Table 2: Classification accuracy of watermarked models on test sets and trigger tests. The trigger sets are derived from the test sets using the Key / Logo method.
location to put the logo isn’t at one of the 4 corners as one would intuitively think. In DE, we set searched the coefficient for blending
instead, which yielded an optimal value of 0.4019.
Like Guo et al. [7], we encoded the message in the pixel values of the Key trigger pattern. The CIFAR-10 variant includes 64 pixels and encodes 128 bits of information. Every pixel in the RGB color space with encodes 2 bits, and the message can be decoded by reading the pixels from left to right, top to bottom. The MNIST variant includes 192 pixels and encodes 192-bit information, with
to encode 0 and 1 respectively. In both cases, pixels in the Key pattern gravitate towards the edge of the images. Clearly, the DE algorithm is rewarding pixel locations that do not overlap with the objects, which tend to occupy the center of the image. The new pixel locations form patterns in a way that has minimal impacts on the classification of an object. Because of that, even we added patterns with large pixel values, the resulting images still didn’t trigger regular models.
To test the capacity of our algorithm, we intentionally used patterns that have a lot more complexity in our experiments on the MNIST dataset. Images in MNIST dataset has a lot more empty space to take advantage of, while pixels blindly accumulate at the edge may cause misclassification. Through our algorithm, the probability increased from as low as 83.30% (during random initialization) to 99.27%. The probability is measured over the entire trigger set, and the classification accuracy on the regular test set with the exact same images is 99.46. It clearly shows the algorithms ability of learning to reduce false positive detections of the watermark. To test our DE algorithm’s ability to converge, we repeated the MNIST/CIFAR10 ”key” on 8 different parameter sets (number of pixels in the pattern, etc.), 5 experiments per set. Eat set all converged to solutions that produce extremely similar fitness scores, with an average standard deviation of 0.0060.
4.2 Effectiveness and Fidelity
It has been demonstrated in all previous works that DNNs can be trained to successfully recognize the triggers sets. In addition, Adi et al. also showed the significance of start training from scratch in creating a robust watermark [4]. We followed the same procedure. Table 2 shows the classification accuracy of both non-watermarked and watermarked models on the regular test set and the trigger set.
In light of the fidelity criterion, the classification accuracy of the watermarked model on the regular test set is sligtly lower compared to the regular non-watermarked model. It is expected as watermarking makes the classification problem much harder. In light of the effectiveness criterion, the ability of the watermarked model to recognize the trigger set is as good as its ability to classify regular images.
4.3 False Positives
Figure 5: False positive rates of different CIFAR-10 trigger sets. It is the probability of a non-watermarked DNN getting falsely triggered. The lower the false positive rate the better.
Figure 5 shows the false positive rate of different trigger patterns. The false postive rate here is measured by the probability that a non-watermarked model classifies a trigger image into its re-assigned class . We used four different non-watermarked DNN trained on the regular CIFAR-10 dataset: ResNet-18, ResNet-50, DenseNet-121, VGG-16. The fitness function in DE used to obtain the trigger pattern only involves the ResNet-18 model. The results show that the what our DE learns from one model generalizes well to other models as well. We tested the generalizability further using the same pattern on 5 newly trained VGG-13s, and obtained a 95% confidence interval of
.
We used two baselines for comparison. To compare with the DE-based Key pattern, we used a Key pattern with random but the same
. We see drastic improvements with up to 10
reduction in the false positive rate. Note that the trigger pattern proposed by Guo et al. is also based on random location [7]. But they explicitly selected
such that the pattern is imperceptible, resulting in a lower false positive rate. But as we see later, they achieved that at the cost of robustness. To compare with the DE-based Logo, we use Logo trigger pattern used Zhang et al. [6]. We achieve about 2
improvement in false positive rate. Putting it in the perspective of Equation 1, even 2
translates to over
lower probability (
).
4.4 Robustness
We measure the robustness of the watermarking methods through their resistance against fine-tune attacks. Table 3 reports trigger set classification accuracy loss after we fine-tuned a watermarked model. Unlike some of the earlier
Table 3: Classification accuracy of watermarked models on corre- sponding CIFAR-10 trigger sets after fine-tune attacks. The less the accuracy loss, the more robust the method is.
approaches that based their attack on the training set, we used 1000 test images and applied various data augmentation techniques. The model watermarked using the original key method suffered a significant drop in accuracy. The accuracy drop was almost entirely eliminated when we switch to our key method. Both of the logo method were resilient against the fine-tune attack.
Images superimposed with our pattern are sufficiently different from the normal input distribution. Because of that, a watermarked model’s ability to recognize those patterns is largely orthogonal to its ability to classify objects and is, therefore, harder to remove during a fine-tune attack.
4.5 Discussions
Gradient-free methods do exist in the world of adversarial learning, most notably in the subproblem of black-box attacks, where the attackers don’t have access to the gradient information [14][15][16][17]. Those methods again focus on individual samples and are essentially solving a different problem than ours. It is worth noting that the work from Moosavi-Dezfooli et al. aims at creating universal adversarial perturbations [18]. Their proposal to reduce the search space to a subset of the input provided invaluable insights.
Due to the limited scope of this paper, the parameter isn’t systematically studied. It is more a heuristic and manually selected in many situations. It would be valuable to study how it systematically affects the robustness of the watermarking methods.
Black-box DNN watermarking has emerged as a viable solution to IP protection in the context of MLaaS. Adding owner identity-based trigger patterns to natural input images is a popular method to create effective trigger sets that establish strong ownership proofs. In this paper, we propose a novel differential evolution-based framework to optimize the generation of such trigger patterns. Compared to the prior art, our method demonstrates significant improvement in false positive rate and robustness in experiments on popular models and datasets.
[1] L. Chapman, “Vcs plowed a record $ 9.3 billion into ai startups last year.” https: //www.bloomberg.com/news/articles/2019-01-08/ vcs-plowed-a-record-9-3-billion-into-ai-startups-last-year, 2019.
[2] Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh, “Embedding watermarks into deep neural networks,” in Proceedings of ACM International Conference on Multimedia Retrieval, pp. 269–277, 2017.
[3] E. L. Merrer, P. Perez, and G. Tr´edan, “Adversarial frontier stitching for remote neural network watermarking,” CoRR, vol. abs/1711.01894, 2017.
[4] Y. Adi, C. Baum, M. Ciss´e, B. Pinkas, and J. Keshet, “Turning your weakness into a strength: Watermarking deep neural networks by backdooring,” in 27th USENIX Security Symposium, pp. 1615–1631, 2018.
[5] B. D. Rohani, H. Chen, and F. Koushanfar, “Deepsigns: A generic watermarking framework for ip protection of deep learning models,” CoRR, vol. abs/1804.00750, 2018.
[6] J. Zhang, Z. Gu, J. Jang, H. Wu, M. P. Stoecklin, H. Huang, and I. Molloy, “Protecting intellectual property of deep neural networks with watermarking,” in Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 159–172, 2018.
[7] J. Guo and M. Potkonjak, “Watermarking deep neural networks for embedded systems,” in Proceedings of the International Conference on Computer-Aided Design, p. 133, 2018.
[8] C. S. Collberg and C. D. Thomborson, “Software watermarking: Models and dynamic embeddings,” in Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 311–324, 1999.
[9] A. B. Kahng, J. Lach, W. H. Mangione-Smith, S. Mantik, I. L. Markov, M. Potkonjak, P. Tucker, H. Wang, and G. Wolfe, “Watermarking techniques for intellectual property protection,” in Proceedings of the 35th Conference on Design Automation, DAC, pp. 776–781, 1998.
[10] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Ex- plaining and harnessing adversarial examples,” CoRR, vol. abs/1412.6572, 2014.
[11] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrik- son, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE European Symposium on Security and Privacy, pp. 372–387, 2016.
[12] N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy, pp. 39–57, 2017.
[13] R. Storn and K. V. Price, “Differential evolution - A sim- ple and efficient heuristic for global optimization over continuous spaces,” J. Global Optimization, vol. 11, no. 4, pp. 341–359, 1997.
[14] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in IEEE Conference
on Computer Vision and Pattern Recognition, pp. 427– 436, 2015.
[15] J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” CoRR, vol. abs/1710.08864, 2017.
[16] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black- box adversarial attacks with limited queries and information,” in Proceedings of the 35th International Conference on Machine Learning, pp. 2142–2151, 2018.
[17] M. Alzantot, Y. Sharma, S. Chakraborty, and M. B. Srivastava, “Genattack: Practical black-box attacks with gradient-free optimization,” CoRR, vol. abs/1805.11090, 2018.
[18] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 86–94, 2017.