Copy Move Source-Target Disambiguation through Multi-Branch CNNs

We propose a method to identify the source and target regions of a copy-move forgery so allow a correct localisation of the tampered area. First, we cast the problem into a hypothesis testing framework whose goal is to decide which region between the two nearly-duplicate regions detected by a generic copy-move detector is the original one. Then we design a multi-branch CNN architecture that solves the hypothesis testing problem by learning a set of features capable to reveal the presence of interpolation artefacts and boundary inconsistencies in the copy-moved area. The proposed architecture, trained on a synthetic dataset explicitly built for this purpose, achieves good results on copy-move forgeries from both synthetic and realistic datasets. Based on our tests, the proposed disambiguation method can reliably reveal the target region even in realistic cases where an approximate version of the copy-move localization mask is provided by a state-of-the-art copy-move detection algorithm.

Index Terms—Copy-move detection and localization, image forensics, tampering detection and localization, deep learning for forensics, Siamese networks.

Thanks to the wide availability of easy-to-use image editing tools, altering the visual content of digital images is becoming simpler and simpler. Copy-Move (CM) forgery, where an image region is copied into another part of the same image, is one of the most common and easy-to-implement image tampering. To detect this kind of forgery, several CM detection and localization algorithms have been proposed, attempting to determine whether a given image contains cloned regions, or so called nearly duplicate regions (in which case, the image is labeled as a suspect or forged image). The great majority of the algorithms proposed so far rely on local hand-crafted features [1], [2], and are grouped into two main categories: block-based (also called patch-based) methods, e.g. [3], [4], and keypoints-based methods, [5], [6], [7]. Both approaches have their strengths and weaknesses and a solution capable to outperform all the others in every working conditions is not available yet. Motivated by the recent trend towards the adoption of Deep Learning (DL) methods for image forensic tasks, DL-based approaches have also been proposed for CM detection. Such methods are capable to automatically learn and extract descriptors from the image, e.g. in [8], [9], by means of Deep Neural Network (DNNs), that hence work as feature extractors. End-to-end DNN-based solutions for copy-move tampering localization have also been proposed, as in [10], [11], where a convolutional and de-convolutional module

Mauro Barni and Benedetta Tondi are with the Department of Information Engineering and Mathematics, University of Siena, 53100 Siena, ITALY; Quoc-Tin Phan is with the Department of Information Engineering and Computer Science, University of Trento, 38123 Trento, ITALY.

work together to directly produce a copy move forgery mask from the to-be-analyzed input image.

The great majority of the algorithms proposed so far can only detect the copy-move forgery and localize the nearly duplicate areas, providing a binary mask that highlights both the source region and its displaced version, without identifying which of the two regions corresponds to the source area and which to the target one. However, in hindsight, only the target region of a copy-move forgery corresponds to a manipulated area; therefore, distinguishing between source and target regions is of primary importance to correctly localize the tampered area and possibly trace back to the goal of the forgery. To the best of our knowledge, the only paper addressing the problem of source-target disambiguation in copy-move forgeries is [11]. In that work, an end-to-end system for CM localization and disambiguation, called BusterNet, is proposed, based on a DNN architecture with two-branches. The first branch is designed to extract a pool of features revealing general traces of manipulations. These features are then combined with those extracted from the other branch, in charge of copy-move detection. With regard to the CM forgery detection (CMFD) performance, the method in [11] provides good results in many cases. Wirth regard to source-target disambiguation, however, the performance achieved on realistic publicly available CM datasets are rather limited. As stated by the authors themselves, this may be due to the limited performance of the manipulation detection branch, which tends to overfit to the synthetic dataset used for training.

In this paper, we propose a new DNN-based method to address the problem of source-target disambiguation in images subject to CM manipulation. Given the binary localization mask produced by a generic copy-move detector, our method permits to derive the actual tampering mask, by identifying the target and source region of the copy-move. The main idea behind the proposed method is to exploit the non-invertibility of the copy-move transformation, due to the presence of interpolation artefacts and local post-processing traces in the displaced region. Specifically, we propose a multi-branch CNN architecture, called DisTool, consisting of two main parallel branches, looking for two different kinds of CM-traces. The first branch, named 4-Twins Net, consists of two parallel Siamese networks, trained in such a way to exploit the non-invertibility of the copy-move process caused by the interpolation artefacts often associated to the copy-move operation. The second branch is a Siamese network [12] designed to identify artefacts and inconsistencies present at the boundary of the copy-moved region. The soft outputs of the two branches are, finally, fused through a simple fusion module. A remarkable strength of the proposed method is that it works independently of the CM detection algorithm, and hence it can be used on top of any such method. Our experiments show that the proposed method has a very good disambiguation capabilities, greatly outperforming those of [11], and that it generalizes well to both synthetic and realistic copy-move forgeries from several different datasets. Robustness to post-processing is also good.

The paper is organized as follows. In Section II, we formalize the CM source-target disambiguation problem addressed in the paper, and present the rationale behind the proposed method. Section III describes the approach followed for estimating the CM geometric transformation between the source and target regions. The details of the multi-branch CNNs composing the system are given in Section IV. In Section V, we describe the methodology we followed to run the experiments whereby we validated the effectiveness of the proposed method. The results of the experiments are reported and discussed in Section VI. The paper ends in Section VII, with some concluding remarks.

In this section, we provide a rigorous formulation of the source-target disambiguation problem and present the overall architecture of the proposed system. Before doing that, we introduce some basic concepts and notation, and detail the main steps involved in the creation of a copy-move forgery.

Among the various instances of copy-move forgeries that can be encountered in practice, in this work, we focus on the common, and simplest, case of a single source region copy-moved into a single target location (referred to as (1-1) CM). The case of n sources singularly copied into n target locations, namely the (n-n) case, can be interpreted as multiple instances of the (1-1) case and can be treated as such. More complicated copy-move forgeries, i.e., multi-target copy moves (i.e., (n- m), with m > n), must be treated differently, for instance, by performing a preliminary analysis in order to trace back this case to the solution of several (1-1) problems, and are left for future investigation. When the target region is partially overlapped to the source, only the non-overlapping parts of the copied and pasted regions are regarded to as a copy-move forgery.


Fig. 1: A CM forged image (a); corresponding binary localization mask (b); tampering map with highlighted the source (green) and target (red) regions (c); final (desired) tampering map (d).

A. Preliminaries

Let I be the original image of size  l × m, and  Ifthe copy-move forgery, of the same size1. We denote with S and T the subparts of  Ifcorresponding to the source and the target region, respectively.


bly altered geometrically, and pasted into T. In its most basic form, the copy-move operation can be modeled as a geometric affine transformation between S and T. Let  Hθdenote the transformation that maps a generic point (u, v) in the source region S into another point  (u′, v′)in T, parameterized by a vector  θ. Such a transformation can be represented in homogeneous coordinates by a matrix  Hθas follows:


The transformation matrix  Hθcan represent a rotation, resizing or scaling, sheering, translation, or, more in general, a composition of them (translation is also considered thanks to the use of homogeneous coordinates). Then, ideally, for every  (u′, v′), we would have  If(u′, v′) = I(u, v), where the relation between (u, v) and  (u′, v′)is established by the matrix  Hθ. In general, after the transformation, the mapped point is not a valid point in the 2D regular pixel grid, that is,  u′and  v′in (1) are not integers. The pixel values at the regular grid points are therefore obtained by interpolating the neighboring pixels of the source region by means of a kernel function  k(·, ·). For simplicity, we let  Ψθ,kdenote the transformation that maps the pixels in the source regions S to those in the target region T, taking into account both the geometric transformation  Hθand the interpolation process with kernel k. Then,  T = Ψθ,k(S)2. The interpolation process introduces correlations among neighboring pixels in T. After interpolation, S and T are nearly duplicate regions (the regions are not identical because of the interpolation). In most cases, the interpolation process makes the copy-move operation noninvertible.

In realistic copy-move forgeries, various post-processing operations might also be applied locally to the target region in order to hide the traces of copy-pasting. For example, the pasted region and the background are often blended to visually hide the transition from the copied part and the surrounding area. Post-processing might also be applied globally, in which case it affects both the source and the target regions.

An example of CM forged image is provided in Fig. 1(a) along with the corresponding localization mask (b). The disambiguation map is provided in Fig. 1(c), where the same color labelling convention of [11] is followed, with the green channel corresponding to the source mask, the red channel to the target mask and the blue channel to the background mask. The final binary tampering mask for the image, where only the target region is highlighted (corresponding to the tampered part), is reported in Fig. 1(d).

B. Problem formulation and rationale of the proposed solution

As we said, our goal is to devise a method for source-target disambiguation that exploits the non-invertibility of the copy-move process caused by interpolation. To improve the effectiveness of the algorithm, we also exploit the possible presence of boundary artifacts in the target region (e.g. those due to blending), which are not present in the source. In fact, even if copy-move tampering is carried out properly, subtle boundary artifacts and edge inconsistencies are often present and can be exploited for the disambiguation task.

The general scheme of the architecture we designed to solve the CM disambiguation problem is provided in Fig. 2. The input to the system are the forged image  If, and the localization mask consisting of two separate regions. Note that we refer to the case of spatially separated regions for sake simplicity, however, the analysis is still valid for contiguous regions, assuming that the CM detection algorithm outputs two distinct regions.


Fig. 2: Scheme of the proposed CM disambiguation system.

Let us first focus on the upper branch of Fig. 2. Given a pair of nearly duplicate regions, our first approach to disambiguate the source and target regions relies on the following observation: if one tries to replicate the copy-move process starting from the source region, i.e. in the forward direction, ideally, it is possible to re-obtain exactly the target region (in practice, the exact parameters of the transformation bringing S into T are not known exactly, so we will only obtain a very good approximation of T). On the other hand, if one tries to mimic a copy-move process starting from the target region, i.e. in the backward direction, an exact copy (or even a good approximation) of the source region can not be obtained, due to the non-invertibility of the copy-move process. In other words, when the target region T is moved onto the source S, the approximated source region differs significantly from S due to the double interpolation process that the transformed region is subject to (from source to target, and then from target to source again), while no interpolation artifacts are present in the source region, thus making the approximation less close than in the opposite case, where both the target and the approximation of the target are subject to a similar (ideally the same) interpolation procedure.

Based on this idea, starting from the two regions and their approximated versions, the problem of disambiguating the source and target regions can be formulated as the following composite hypothesis test. Let  P1and  P2denote the two nearly duplicate regions resulting from the binary localization map provided by the copy-move detector. Then, the composite hypothesis test we have to solve must decide between the following cases3:

 H0: P2 ≈ Ψθ0,k0(P1), i.e.,  P1 ≡ S(and  P2 ≡ T).

 H1: P1 ≈ Ψθ1,k1(P2), i.e.,  P2 ≡ S(and  P1 ≡ T) where  θ0and  θ1are the parameters of the transformation bringing  P1into  P2and viceversa, and  k0and  k1are the interpolation kernel parameters. When hypothesis  H0holds, then  Ψθ0,k0corresponds to the transformation applied during the copy-move process, for some unknown parameter vector θ0of the geometric transformation  Hθ0, and kernel  k0of the interpolation.

To test the two hypotheses, we need to consider the transformation that moves  P1to  P2, and viceversa (i.e., the transformation that moves  P2to  P1), and try to guess which of the two is the forward direction. Therefore, as depicted in Fig. 2, we should first estimate the parameters of the transformation under both hypotheses and then choose the direction for which the approximation obtained by means of the estimated transformation is the best one. Formally, this is equivalent to solve the following generalized likelihood ratio test (GLRT):


For simplicity, the effects at the borders of the target region, due to possible local post-processing, are not taken into account in the above formulation.

Since the interpolation method adopted for the copy move is unknown, strictly speaking, it should be estimated. However, in our practical implementation, we have assumed that a bilinear interpolation is used, hence  k0 = k1 = k, where k is the bilinear kernel4. Then, we only estimate the parameters of the geometric transformations, that is,  θ0and  θ1.

When the interpolation artifacts are weak or not present at all, e.g., when the copy-move consists of a rigid translation of an integer number of pixels, we have  P1 ≈ Ψθ0,k(P2)and P2 ≈ Ψθ1,k(P1), with  Hθ1 = H−1θ0(the kernel k is close to a delta function), and then we cannot make a reliable decision based on the test in (2). The bottom branch of the scheme in Fig. 2 is introduced to cope with these cases. Such a branch exploits the possible presence of artifacts along the boundaries of  P1and  P2. For the target region, in fact, boundary artifacts are likely to be present given that the inner and outer parts of T come from different parts of I. These artifacts are not expected to be present across the boundary of S. Therefore, the presence of such artifacts or other inconsistencies along the boundary of one region between  P1and  P2can be exploited to decide which of the two regions correspond to S and which to T. A further motivation for the inclusion of a branch dedicated to the presence of artifacts along region boundaries, is that the interpolation traces could be partially erased when a strong post-processing is applied globally, thus making it difficult to solve the disambiguation problem via the composite test formalized above. In these cases, the analysis of boundary inconsistencies can be useful.

Eventually, the result of the analysis of interpolation and boundary artifacts is fused (last block in Fig. 2).


Fig. 3: Illustrations of rotation angle and scaling factor estimation.

Copy-move detection methods provide a binary localization mask highlighting the regions interested by the copy move, yet only some of them provide an estimate of the geometric transformation mapping one region into the other. Therefore, in the first step of the disambiguation chain, we estimate the geometric transformation bringing  P1into  P2. It goes without saying that the estimation can be skipped if the copy-move detector already provides such an estimate (see for instance the keypoint-based detector in [13], where an estimation of the transformation is provided via the RANSAC algorithm [14]). Hereafter, we describe the method adopted in our system to estimate the parameters of the affine geometric transformation moving one region into the other. With a slight abuse of notation, in this section, the regions  P1and  P2are regarded as the sets with the coordinates of the pixels in the  x − yimage plane, and not as the values of the pixels belonging to the regions. Given the binary localization mask, our goal is to estimate the homography matrix (having the form in (1)), that maps the pixels in  P1into those of  P2, to obtain a remapped region �P2that is as similar as possible to  P2(and viceversa for the backward transformation). In its more general form,  Hθis an affine geometric transformation. In this paper, for simplicity, we consider only similarity transformations, namely translations, resizing or scaling and rotations and, more in general, a composition of them. In this case, the transformation in equation (1) can always be expressed as the subsequent application of a rotation with angle  α, a scaling with factors fx, fy, and two translations  tx, ty, that is5


With the above ideas in mind, the estimation of the geometric transformation between the candidate source and target regions is carried out according to the following three steps: i) estimation of the rotation angle  α, ii) estimation of the resizing factors  fx, fy, and iii) estimation of the translations  txand  ty.

To estimate  α, we find the two central principal inertia axes of  P1and  P2and let  αbe equal to the difference between them. Note that such axes are well defined for most regions since pasted objects are never perfectly circular. In particular, we adopt a Principle Component Analysis (PCA) [15], to determine the direction along with the second-order central moments of projected points is maximized. The directions found in this way, represented by the column vectors u1 = [u1,1, u1,2]T, and  u2 = [u2,1, u2,2]T, are illustrated in Fig. 3(a).

To be more specific, let us denote with P the  2 × Nmatrix whose i-th column  pirepresents the vector of 2-D coordinates of point i within  P1, 1 ≤ i ≤ N, where N denotes the number of points in  P1, and with  ¯p = 1N�Ni=1 pithe centroid of  P1. If we assume  uT1 u1 = 1, the second-order central moment of the projected points in  P1is given by:


where  S1 = 1N�Ni=1 (pi − ¯p) (pi − ¯p)Tis the inertia matrix of the the points in  P1. The principle component  u1is found by:


It can be demonstrated that  u1is the eigenvector corresponding to the largest eigenvalue of  S1. We find  u2in a similar way. Then the angle  αis computed as:


Once the rotation angle has been estimated,  P1is rotated by the angle  α, to obtain a new region  P′1. To estimate the scaling parameters  fxand  fy, we first determine the  h1 × w1bounding box of  P′1and the  h2 × w2bounding box of  P2, as illustrated in Fig. 3 (b), the scaling factors are then computed as:  fx = w2w1 , fy = h2h1. Finally, the translation terms are merely the difference between the centroids of  P′1and  P2.

By applying the same procedure to estimate the transformation mapping  P2into  P1, we would simply obtain  H−1θ. Therefore, for simplicity, the transformation is estimated in one direction only (as detailed above), and  H−1ˆθis used as the transformation bringing  P2onto  P16. In the following, we let �P2 = Ψˆθ,k(P1), where ˆθis theestimated vector of the parameters of the transformation that moves  P1into  P2(w.l.o.g.), and then �P1 = Ψˆθ′,k(P2)where Hθ′ = Hθ−1.

The core of the disambiguation system is represented by the blocks that analyze the interpolation artifacts and the boundary inconsistencies (see Fig. 2). For their implementation, we designed two multiple-branch classifiers based on CNNs: a network with 4 parallel branches, called 4-Twins Net, and a Siamese network [12], named Siamese Net. The 4-Twins network is in charge of analysing the interpolation artifacts, while the Siamese network is used to reveal boundary inconsistencies. The outputs of the two networks are finally merged by a score-level fusion module. A block diagram of the resulting architecture, hereafter referred to as DisTool, is shown in Fig. 4. A preliminary step is carried out before running the two networks to identify the input region, or Focus of Attention (FoA), of the networks. Each FoA module takes as input the forged image  If, the binary localization mask (i.e., the output mask of the CM detection algorithm) with the two separate regions  P1and  P2, and the geometric transformations estimated as explained in Section III.


Fig. 4: Block diagram of the proposed disambiguation method based on multiple-branch CNNs (DisTool), implementing the scheme in Fig. 2.

We observe that, while Siamese-like architectures have recently been used for addressing several multimedia forensic tasks, see for instance [16], [17], [18], we explicitly designed the 4-Twins architecture for our specific purpose, in order to facilitate the learning of the interpolation artifacts. The motivation behind the use of this multi-branch architecture will be more clear in the sequel.

A. 4-Twins Net

The 4-Twins network takes as input the two pairs of regions (P1, �P1)and  (P2, �P2)(the specific FoA for 4-Twins Net is described in Section IV-A1)

Let  x = [x1, x2, x3, x4] = [(P1, �P1), (P2, �P2)]and y = 0, 1 be a vector with the pixels of the regions  P1, �P1, P2,and �P2, and let  y ∈ {0, 1}indicate the identity of the source and target regions, namely y = 0 if  x = [(S, �S), (T, �T)](holding under hypothesis  H0), and y = 1 if  x = [(T, �T), (S, �S)](holding under hypothesis  H1). An illustrative example of the patches at the input of 4-Twins Net is provided in Fig. 5 (upper row).

The decision is in favor of the hypothesis that maximizes the output score (softmax) function  ftw. Therefore, if


then  P1is identified as the source region (H0holds), and viceversa if the opposite inequality holds.

The architecture of 4-Twins Net and the details of the training procedure are described in the following. Before, we give the details of the FoA module.


Fig. 5: Illustrative example of the patches at the input of 4-Twins Net (upper row) and Siamese Net (lower row).

1) Focus of Attention (FoA): The two pairs of regions

(P1, �P1)and  (P2, �P2)can not be directly fed to 4-Twins Net. The practical problem is that the source and target regions of a copy-move can be large and, moreover, their sizes vary from image to image. In order to feed all the branches with patches of the same size, that we set to  64 × 64 × 3, the 4-dim input vector x of 4-Twins Net is built as follows (the first steps are common to Siamese Net). Given the two regions  P1and  P2, we fit a rectangular bounding box to each region. Let us denote the bounding box of  P1as P b1. The bounding box will then contain the entire region  P1(foreground) and some neighboring pixels belonging to ¯P1(background). In the same way, we build the rectangular patch P b2. Then, we compute  �P b2 = Ψˆθ,k(P b1)and  �P b1 = Ψˆθ′,k(P b2), using bilinear interpolation. In this way, we get the quadruple [(P b1, �P b1), (P b2, �P b2)]. To get the 4 inputs of 4-Twins Net, we crop the  64 × 64central part of each region in the quadruple. Notice that, in this way, we are implicitly assuming that the bounding boxes of the source and target regions of the copy-move regions are always larger than  64 × 64(hence,  64 × 64is considered as minimum region size).

To avoid complicating the notation, in the following we will not distinguish between regions and patches and continue to refer to  P1, P2, and �P1, �P2to denote the inputs of 4-Twins Net. An example of the patches forming the input vector x of 4-Twins Net is given in Fig. 7 for the example in Fig. 6.

2) Network Architecture: The architecture of the 4-Twins

Net is given in Fig. 8. It consists of four identical stacks of convolutional layers (i.e. all of them share the same weights), and two identical stacks of fully connected layers. The role of the stacked convolutional layers in each branch is to extract a 512-dim feature vector from each input patch, of size  64 ×64 × 3. We denote the stacked convolutional layers as  F(·). The 512-dim feature vectors from the first and second pairs of branches are concatenated by means of a combination function C(·, ·)in a 1024-dim vector, and then given as input to the fully connected layers. The fully connected layers return a score (called logit) which is later normalized into a probability value by means of softmax non-linear activation functions. In


Fig. 6: Forged image from CASIA [21] (left) and its ground truth tampering map (right).


Fig. 7: Input vector for 4-Twins Net for the image in Fig. 6 (center crop).

summary, the 4-Twins architecture consists of two Siamese networks in parallel, sharing the weights of the convolutional layers and the fully connected layers.

For each Siamese network, we used exactly the same pipeline which has been successfully used as a matching model in computer vision [19], [20]. Each Siamese network has a single output neuron. Let  z0, z1denote the score outputs (logits) of the two Siamese network branches with inputs (x1, x2)and  (x3, x4), respectively. The dependency between z0and  z1is enforced by the following softmax operation:


Given M training examples  {(x(j), y(j))j∈[1,M]}, the 4-Twins Net is trained to minimize the empirical cross entropy loss function between input labels and predictions, that is:


where  [y(j)0 , y(j)1 ]is the one-hot encoding of  y(j)(the one-hot encoding of label 0 is the binary vector [0, 1], that of label 1 is [1, 0]). From (5), we observe that, in order to get a small loss, z0and  z1must be respectively large and small when y = 0 (i.e., under  H0), and small and large, when y = 1 (i.e., under H1).

In the following, we report the details of the feature extraction, combination and fully connected part of each Siamese branch.

Feature extractor F. We considered the 50-layers Residual Network (ResNet) in [22]. Such a deep architecture is well suited to learn complex pixel relationships 7. We refer to [22] for a detailed description of this network. The only change we made compared to [22] is the output size, which is set to 512 instead of 1000. Then, in our architecture, we considered 4 identical branches of 50-layers ResNet (F), with shared weights, for the convolutional part.


Fig. 8: The proposed 4-Twins Net architecture. The one-hot encoding of the predicted label is reported in output.

Feature combiner C. Before feeding the fully-connected layers, we need to fuse the feature vectors produced by the two Siamese branches F. Some popular choices for doing so are: the point-wise absolute difference [20], the square Euclidean distance [16], and the concatenation [18]. We chose to implement the combination by means of a concatenation as done in [18].

Fully connected part. We considered 2 fully connected layers with input and output sizes respectively equal to 1024 and 256, and, 256 and 1. The final soft output of the two fully connected branches are combined by means of a softmax layer, as detailed in the previous section.

3) Training strategy: In this section, we describe the strategies that we followed to feed the data to 4-Twins Net during training. During our experiments we found that such strategies are critical to the success of the 4-Twins Net.

The network is trained with both positive (H0) and negative (H1) examples, in equal percentage; then, the trained model minimizes the overall error probability over the training set. To force the network to learn the interpolation artifacts, the source and target regions of the forged images used for training are purposely built so that they are always much larger than 64×64(see Section V-A for the details of the dataset creation process). In this way, the  64 × 64input patches obtained by cropping the central part of the regions contain only foreground pixels. Training 4-Twins Net is performed knowing the ground truth localization mask and the exact geometric transformations between S and T, that is, the forward and backward transformation  Hθ. Then, the approximated regions �P b1and  �P b2are derived by considering the true transformation matrix  Hθ. A small random perturbation is applied in order to mimic a practical scenario in which the transformation estimation is not perfect. Specifically, the true angle is perturbed by a random quantity in  [−5◦, 5◦](with step  1◦), and the true resizing factor is randomly distorted by a value in  [−0.1, 0.1](with quantization step 0.01).

Due to feature concatenation, the network is sensitive to the order of the inputs in each pair, that is in  (x1, x2)and  (x3, x4). Let us assume that the first input corresponds to the original (source or target) patch and the second input to the transformed patch. In principle, switching between  x1and  x2, as well as between  x3and  x4, should leave the predictions unchanged. To enforce this property, we randomly shuffle  (x1, x2)and (x3, x4)during training so that 4-Twins Net does not learn the order of the inputs. Under y = 0, this corresponds to consider not only the pair  x = [(S, �S), (T, �T)], but also the pairs  x = [(�S, S), (T, �T)], x = [(�S, S), ( �T, T)], and x = [(S, �S), ( �T, T)]. A similar strategy is applied under y = 1. Moreover, since the 4 branches of the convolutional layers are forced to be identical, each of them is fed with samples from all the categories during training, that is {S, �S, T, �T}, so to avoid any bias.

Batch Normalization (BN) [23] is performed after each layer in the feature extraction part F, by normalizing the layer outputs so that hey have zero-mean and unit-variance. Normalization is done by accumulating means and standard deviations on mini-batches. Given that in 4-Twins Net, the data flows through four branches, this procedure needs care: in particular, in order to avoid biasing the accumulated means and standard deviations, we ensure that, within each minibatch, each of the four branches is fed with all four categories {S, �S, T, �T}. Then, the statistics are accumulated on one branch only and broadcasted to the other branches in order to make the four Fs identical.

B. Siamese Net

As we said, the goal of the Siamese Net is to detect boundary inconsistencies. The choice of this structure was based on the following observation. When  P1is the source (H0), we expect that the pixels across the boundary of  P1and the complementary region ¯P1do not present significant inconsistencies, while, when  P1is the target (H1), the presence of inconsistencies along the boundary between  P1and ¯P1is more likely. Let  B1(res.  B2) denote an image region that includes  P1(res.  P2) and some outer pixels of  P1(res.  P2) , i.e., part of the complementary region ¯P1(res. ¯P2). Similarly, BS(res.  BT) denotes an image region that includes S (res. T) and some outer pixels of S (res. T). Regions  B1and  B2have the same (or very similar) content inside the inner region and a different content in the outer part. So we would like that the network learns to focus on the relationship between the inner and outer region. i.e. to focus on the values of the pixels across the boundary of the copied part.

The transformation that maps one region into the other, e.g. P1into  P2, also maps (at least approximately because of the interpolation) the boundary of  P1into that of  P2. Therefore, the Siamese network is fed with input pairs  (B1, �B1)(or, similarly,  (B2, �B2)) where �B1is obtained by remapping  B2according to the geometric transformation that maps  P2into P1, that is, �B1 = Ψˆθ,k(B2). The details about the exact way whereby the regions  B1and  B2are built, pertaining to the FoA block preceding Siamese Net, are described in Section IV-B1. Let  x′ = [x′1, x′2] = [B1, �B1]and  y ∈ {0, 1}. The relative position of the patches in the pair determines the value of y: if  x′ = [BS, �BS], that is  P1 ≡ S(hypothesis  H0), then y = 0; if instead  x′ = [BT , �BT ], that is  P2 ≡ S(hypothesis H1), then y = 1.

An illustrative example of input pairs feeding the Siamese Net is shown in Fig. 5 (lower row).

The decision is in favor of the hypothesis that maximizes the output soft function  fsi(·). Therefore, the condition


indicates that  P1is the source (H0holds), while the opposite inequality indicates that  P2is the source (H1holds). We notice that the use of an architecture with 4 branches, like the 4-Twins Net, is not necessary in this case. In fact, regardless of the direction of the transformation, one region between  B1and �B1, will exhibit inconsistencies between the pixels inside and those outside the boundary, while the other will not. In a similar way, one between  B2and �B2will contain inconsistencies across the boundary, while the other will not.


of the training procedure are described in the following.

1) Focus of Attention (FoA): To get the input pair xfor

Siamese Net, we start with one of the pairs of bounding box regions  (P b1, �P b1)and  (P b2, �P b2), obtained as described in Section IV-A1. In order to increase the chance of capturing a good extent of boundary regions, each bounding box region is cropped at the 4 corners, i.e., top left, top right, bottom left, bottom right to get the  64×64input patches. All the resulting 4 input pairs are tested and the most confident prediction score is selected for the final decision.


B2and �B1, �B2, to denote the inputs of Siamese Net.

example in Fig. 6, are provided in Fig. 9.

2) Network architecture: The architecture of Siamese Net

corresponds to the one forming the branches of 4-Twins Net. Let z be the output (logit) of the Siamese neural network, the soft (probabilistic) score  fsiis computed through a sigmoid activation:


Given M training examples {�x(j), y(j)�j∈[1,M]}, the Siamese Net is trained to minimize the empirical cross entropy loss  Lsibetween the predictions  f (j)siand the input labels  y(j).

3) Training strategy: To force the network to look at boundary inconsistencies, we trained Siamese Net by considering only copy moves obtained by rigid translations, so that: i) no interpolation artifacts are present (the copy moved part is identical to the source region), ii) the boundaries of the two regions match perfectly. The input pair used during training then corresponds to  x′ = [BS, BT ]and  x′ = [BT , BS].

In order to avoid undesired biases, the network is fed with inputs of the form  x′ = [BS, BT ]and  x′ = [BT , BS]in a


Fig. 9: Examples of the input pairs for Siamese Net, for the image in Fig. 6. y = 0 for the examples of input pairs in the left (since  B1 = S), while for those on the right y = 1 (since B2 = T).

similar percentage. Note that in this case, switching  x′1and x′2is accompanied by label switching: in fact, according to the way we trained the network, the output of Siamese Net depends on the relative position of the patch containing the source boundary and the target boundary. Therefore, if  x′ =[BS, BT ], then y = 0, whereas if  x = [BT , BS], then y = 1.

C. Fusion module

The output scores  ftw(·)and  fsi(·)provided by 4-Twins Net and Siamese Net are fused by means of a simple fusion module, as illustrated in Fig. 4. Score-level fusion is performed by assigning a reliability to the output of 4-Twins Net and Siamese Net, based on the knowledge we have about the performance of the two networks under various settings. More specifically, the two scores  ftw(·)and  fsi(·)are weighted based on the (real or estimated) transformation mapping  P1into  P2. Let us denote with  wtr(Hθ)the weight assigned to the 4-Twins Net score when the estimated transformation is Hθ (wtr ∈ [0, 1]), and with  wsi(Hθ)the weight assigned to the output of Siamese Net, where  wtr(Hθ) + wsi(Hθ) = 1.

We anticipate that, based on our tests, 4-Twins Net achieves very good performance when the transformation can be estimated with sufficient accuracy and relatively strong interpolation artifacts are present in the image, while it is less reliable in the other cases, that is, basically, when the copy-move is close to a rigid translation. When the transformation is a rigid translation, in fact, the two input pairs of 4-Twins Net are identical. On the other hand, Siamese Net works very well (with almost perfect performance) when a close to rigid translation is applied, while is generally less reliable in the other cases. For this reason, in our experiments we considered

the following weights:


where c is a constant larger than 0.5 (we remind that  αdenotes the rotation angle and  fxand  ftthe scaling factors). As to  wsi, we obviously have  wsi = 1 − wtr.

An alternative solution could be to choose one of the two networks before actually applying them, e.g. based on the estimated transformation (network selection scenario). However, fusing the outputs of both networks permits to get an advantage when a choice between one of the two architectures cannot be properly made. As we will see in the experimental section, this is the case, for instance, when heavy local post-processing is applied to the boundary of the target region (in which case the Siamese Net looses accuracy, while 4-Twins Net is more robust and still works well), or in the presence of global post processing, e.g. JPEG compression (in which case the performance of 4-Twins Net is heavily impaired). In these cases, fusing the outputs of both networks allows to get better performance.

In this section, we first describe the procedure that we followed to generate the synthetic datasets used for training and validating 4-Twins Net and Siamese Net. Then, we present the datasets (both synthetic and real) used for testing. Finally, we describe the scenarios considered in our tests.

A. Synthetic Dataset Creation

To train the multi-branch CNN, and in particular the 4-Twins Net, a large amount of labeled data is needed, i.e. many (x, y) samples. Therefore, a large amount of copy-move forgeries with ground-truth mask and labeled source and target regions is required. In [11], a dataset with  105copy-move forged images has been built and made publicly available. This dataset, however, is too small for our goal. To avoid the risk of overfitting, we synthesized ourselves a large-scale synthetic dataset of copy-move forged images by considering several geometrical transformations and post-processing, starting from pristine images of different datasets8. Specifically, a dataset with  9×105forged images, hereafter denoted as SYN-Tr was generated for training (and validation). We also built a smaller set of  3 × 103forged images for testing (namely SYN-Ts), as detailed in Section V-B. The creation of SYN-Tr (and SYN-Ts) involves three steps. Background preparation. We first selected a pool of pristine images from several datasets. Specifically, approximately 28, 000 images (both in raw and JPEG formats) were taken from the RAISE_2k [24], DRESDEN [25] and VISION [26] datasets to build SYN-Tr, in similar proportions. For SYNTs, we took 500 images from a personal camera Canon 600D

(250 raw and 250 JPEG images, compressed using default camera settings. For each image, we generated multiple forged instances (as detailed below) by randomly cropping portions of size  1024 × 1024. The images having minimum dimension smaller than 1024 were skipped.

Source selection. Each  1024 × 1024image is split into four subregions or quadrants. The source region is obtained by considering one of these quadrants and generating a convex polygon (from a subset of 20 random vertices, selected in such a way that they form a convex hull) within a bounding box of sizes  170 × 170, randomly located within the selected quadrant. The pixels inside the convex polygon belong to the source region and then constitute the region S.

Target creation. The target region is obtained from the source by means of a similarity transformation. In particular, we considered rotation, resizing, and a composition of them (i.e., rotation followed by resizing, and resizing followed by rotation). Rotation angles were randomly picked in the range [2◦, 180◦], with a sampling step of  2◦, while horizontal and vertical resizing factors were randomly picked in [0.5, 2.0], with sampling step 0.01. The geometrically reshaped region is copy-pasted in the center of one of the three remaining quadrants, thus obtaining the target region T. With regard to the interpolation method, we used a bilinear interpolation. To improve the quality of the forged images making them more realistic, we blurred the boundary of the target region by applying the following steps: i) detection of the edge of the region from the target mask by using a high-pass filter 5 × 5, performing binary dilation for several iterations to emphasize the edges (the number of iteration is empirically set to 5), obtaining an edge enhanced mask ; ii) application of an average filter to the image, with a size randomly selected in {3×3, 5×5, 7×7, 9×9, 11×11}, in the positions identified by the edge enhanced mask. Eventually, to mimic a real scenario, we applied global post-processing with probability 0.5. The post-processing types and the corresponding parameters are detailed in Table I, along with their selection probability.

Another dataset, named SYN-Tr-Rigid, was generated to train the Siamese Net, by starting from the same pool of images, but considering only rigid translations. Source selection has been done within smaller bounding boxes of size  74 × 74such that the boundary can be easily captured during the patch extraction process. Global post-processing is finally applied (with probability 0.5) similarly as before. With regard to the test set SYN-Ts, for each kind of transformation (H), 1000 forged images were generated, 500 with post-processing (PP) -as described in Table I - and 500 without postprocessing. In the following, we denote with SYN-Ts-H and SYN-Ts-H-PP the datasets of test forged images generated using transformation H, respectively without and with post-processing (PP). H can be a rigid translation (Rigid), rotation and translation (Rot), resizing and translation (Res).

B. Evaluation Datasets

We assessed the performance of our system on the datasets reported below, all providing ground truth mask and source-target labels for the copy-move forgeries.


Fig. 10: Examples of forged images from the 4 datasets used in our experiments. Red: target, green: source, blue: background.

SYN-Ts. As detailed in Section V-A, this dataset contains 3 × 103test forged images. USCISI [11]. A synthetic dataset, consisting of  105

images, that were used for training and testing BusterNet (in 9 to 1 proportion). All the images are taken from SUN2012 dataset [27] and Microsoft COCO [28] that provide the object segmentation mask. Objects are copy-moved by means of geometrical transformations (see [11] for more details). For our tests, we used all the  104test images.

CASIA [21]. CASIA9 is the largest publicly available benchmarking dataset for image forgery detection. A subset of 1313 copy-move forged images was manually selected, out of all the 5123 tampered ones, by the authors of [11], to build this dataset of copy-moves, made available online. Source and target regions were labeled by comparing the tampered and the pristine images.

Grip [29]. This dataset consists of 80 images tampered with rigid copy-moves. Two post-processing, i.e. local noise addition and global JPEG compression, were applied to these images, with different parameters, using the software in [30], thus producing several categories of copy-move forgeries. We manually annotated the source and target regions of all the forged images by looking at the information on the top-left coordinates of the source and target regions provided by the software. Even if rather small, this dataset is useful to test the performance in the case of rigid copy-move. Some examples of copy-move forgeries from the four datasets are depicted in Fig. 10.

C. Parameters setting for networks training and fusion

The two networks 4-Twins Net and Siamese Net were trained independently by using 810, 000 images from the SYN-Tr dataset; the remaining 90, 000 images were reserved

TABLE I: Post-processing and corresponding probabilities.


for validation. We trained both networks for approximately 60 epochs (375, 000 iterations with batch size 128) using Adam optimizer. The learning rate was set to  10−4, and halved every 10 epochs from epoch 40 to improve convergence. We used the Tensorflow framework for network training and testing.

For score-level fusion, the weighs  wTand  wS = 1 − wTwere set as in equation (8). We considered several values of the constant c and selected the one achieving the best fusion accuracies over the synthetic testing dataset SYN-Ts, corresponding to c = 0.65.

D. Testing Scenarios

The testing scenarios considered for our experiments correspond to: i) the ideal case of known binary mask with undistinguished S and T, and known transformation, ii) the case of known binary mask only, and iii) the realistic case where everything is unknown and the mask corresponds to the output of a state-of-the-art CM detection and localization algorithm. The first two scenarios were considered to test the disambiguation capability of the proposed approach, and the impact of possible inaccuracies introduced by the estimation of the geometric transformation. Then, in the third scenario, we assessed the performance of an end-to-end system for copy-move detection and localization that uses DisTool to identify the source and target regions of the copy-move.

1) Known mask and transformation: In order to assess

the disambiguation capability of DisTool, we consider the case in which both the binary localization mask (ground-truth localization mask) and the transformation  Hθare given. For these tests, we used the SYN-Ts and USCISI datasets, which provide the ground truth for the transformation matrix. From the forged images, the input patches of the 4-Twins Net and Siamese Net branches are determined as detailed in Section IV-A1 and IV-B1. The two separate regions of the ground-truth mask  P1and  P2, to be given as input to DisTool, are isolated from the tampering map.

2) Known mask only: In this second testing scenario, we considered the case where only the binary localization mask is known. We then used the method described in Section III to estimate the transformation from the binary masks of the two regions. We tested the performance of DisTool on all the four datasets, namely SYN-Ts, USCISI, CASIA, and Grip. Furthermore, we assessed the robustness of the system to post-processing on SYN-Ts-H-PP and Grip datasets, for which processed versions of the forged images are provided.

3) End-to-end performance: In this scenario, we evaluated the performance of an end-to-end system for simultaneously copy-move localization and source-target disambiguation by means of DisTool. For copy-move localization, we considered the patch-based algorithm in [4], hereafter referred to as DFCMFD (Dense Field Copy-Move Forgery Detection), which works reasonably well under general conditions (e.g., also when the copy-moved area has a small size, or in presence of local post-processing). In this case, a pre-processing step has to be applied to determine the two regions,  P1and  P2, from the binary output mask provided by the localization algorithm. If more than two regions are identified, then the (1-1) condition is not met and the image is discarded. Specifically, we first process the mask by applying a morphological opening, with a square structuring element of size  2×2. After that, we perform Connected Component (CC) analysis to label connected regions, sort them by size, and discard the images for which the ratio between the size of the third-ranked and second-ranked regions is not small enough (the threshold is empirically set to 0.2). The number of images retained after this stage is denoted as  OptIn 10. The performance of the system are evaluated on the OptIn set only. For the opted out images, in fact, the two separated regions defining the CM operation cannot be identified, and the disambiguation system cannot be run. Reasonably, this should be regarded to as a failure of the CM localization algorithm (more rarely, as a failure of the non ideal pre-processing step).

For this testing scenario, the results are compared with those achieved by BusterNet [11], which simultaneously aims at copy-move localization and source-target disambiguation. For a fair comparison, the disambiguation performance achieved by BusterNet are assessed by considering the subset of images for which two separate regions can be identified by the CM localization branch of the algorithm (OptInB). Unlike our approach, BusterNet may return the same label for the two regions, that is, the regions are simultaneously labeled as source and target. In this case, the algorithm fails and the accuracy of the disambiguation is equivalent to a random choice (error probability equal to 0.5). Notice that a comparison with BusterNet is not possible for the first two testing scenarios, since the method in [11] is an end-to-end one providing at the same time the result of localization and disambiguation, without the possibility of taking a localization mask as input for the disambiguation part only.

As we said, we run our tests for the case of single source and target copy-moves. In the datasets considered for our


Fig. 11: Accuracy of DisTool (both 4-Twins Net, Siamese Net, and fusion) for Grip under JPEG compression (left) and noise addition (right). 80 images are considered for each case.

experiments, the number of images satisfying such condition are: 9984 out of 10000 for USCISI-CMFD, 1276 out of 1313 for CASIA-CMFD and the entire Grip-CMFD.

As evaluation metric, we considered the accuracy of the disambiguation task, computed as the ratio of correctly disambiguated copy-moves over the total number of opted-in images. In the first and second testing scenario, the OptIn set corresponds to the set of all the images with single source and target.

A. Known Mask and Transformation

Table II reports the accuracy of 4-Twins Net, Siamese Net, and after the final fusion step, on SYN-Ts and USCISI.

These results confirm that 4-Twins Net works very well in all the cases, but when the transformation is a rigid translation (SYN-Ts-Rigid) because the four patches are very similar. Siamese Net instead works well in the presence of rigid translation, as expected, while it exhibits slightly lower performance in the presence of rotation and resizing. The performance of Siamese Net on USCISI are very poor, probably because most of the transformations in USCISI include very strong rotation and resizing, and the boundaries are blended using a particular editing operation (Poison editing) [31], which has not been considered in our training sets.

Nevertheless, thanks to the final fusion step, the overall system achieves very good performance in all the cases and the loss of performance with respect to 4-Twins Net and Siamese Net in their best performing scenarios is very limited. In particular, the results achieved by DisTool on USCISI show that the proposed architecture works well also under database mismatch conditions thus proving the good generalization capability of our system.

TABLE II: Accuracy (%) of 4-Twins Net, Siamese Net and DisTool on SYN-Ts and USCISI. The ground-truth is available for both the mask and the transformation matrix.


B. Known Mask only

The accuracies of our system in this scenario are reported in Table III. By looking at the performance on SYN-Ts and USCISI, we can draw conclusions similar to those we drew for the known transformation case (Table II), thus indicating that our method for estimating the transformation works well. When the more realistic datasets CASIA and Grip, are considered, the performance decrease a bit. This is not surprising, given that the copy-move forgeries contained in these datasets are produced manually in different ways, and under various processing operations. For instance, forged images in CASIA are produced by Photoshop, and advanced tools for tonal adjustments have been used. The forgeries contained in Grip consist of visually realistic snippets designed carefully by photographic experts. Therefore, the results achieved in these cases are also satisfactory. The poor performance of 4-Twins Net on Grip are due to the fact that the copy-move forgeries are all rigid translations (as in SYN-Ts-Rigid). Again, the fusion step allows to improve the results of 4-Twins Net in the most critical cases of close-to-rigid translations, without impairing too much the performance in the other cases.

We also performed a robustness analysis. The performance of our system in the presence of post-processing as described in Table I and assessed on the SYN-Ts-H-PP dataset are reported in Table IV. The robustness performance of our system on Grip dataset, under JPEG compression with different quality factors (QFs) and addition of local noise of various strength, are shown in Fig. 11. We observe that JPEG compression, being a global post-processing, has a minor impact on the performance of the system, unless the quality of the image is significantly impaired (QF < 60). Addition of local noise, instead, adds visual traces in the target and thus source and target can be more easily disambiguated when noise increases.

TABLE III: Accuracy (%) of 4-Twins Net, Siamese Net and DisTool on all the four datasets. Only the binary mask is given.


TABLE IV: Accuracy (%) of 4-Twins Net, Siamese Net and DisTool on SYN-Ts-H-PP. Only the binary mask is given.


C. End-to-end performance

In this section, we report the performance of DisTool when the network is used within an end-to-end copy-move detection system with detection, localization and source-target disambiguation capabilities. With regard to the CM detection and localization algorithm, we considered the DF-CMFD method in [4] in all the cases, with the exception of the USCISI dataset, where the method in [4] works poorly, and we used the CNN-based method proposed in [11] (BusterNet-CMFD). The results of our tests are reported in Table V. In this case, the OptIn images are the images for which the two duplicated regions can be correctly identified after the application the CM localization algorithm and the pre-processing. This number is lower than the total number, much lower in some cases, mainly due to the failures of the CM detection algorithm

The performance are compared to those achieved by BusterNet, on the  OptInBset (last two columns). Even if the number of OptIn and  OptInBimages is not the same, mainly due to the difference in the localization method adopted in the two cases, OptIn and  OptInBhave a similar meaning (see Section V-D3), then, the disambiguation performance achieved by our system on the OptIn set can be fairly compared to those achieved by BusterNet on the set of  OptInBimages.

Noticeably, the number of duplicated regions correctly localized by DF-CMFD (OptIn) is always higher than the corresponding number by BusterNet (OptInB), the only exception being CASIA, where  OptInBis 688, while OptIn is 482. However, the performance of BusterNet in this case are very poor, and the average accuracy of the disambiguation is about 50% (hence similar to a random guess). We also observed that several times the same label (source or target) is assigned to the two duplicated regions by BusterNet, meaning that the method is not able to disambiguate between them.

By inspecting the table, we see that BusterNet method gives better results compared to our method only on USCISI, which is the same dataset used for training, hence corresponding to a favorable case for that method. With the exception of the USCISI dataset, DisTool always outperforms BusterNet, achieving a better accuracy on all the datasets. Noticeably, DisTool works pretty well in the most difficult cases with public realistic datasets (CASIA and Grip).

We further emphasize that, in our experimental analysis, we only considered two common and well-known methods for CM detection and localization, namely those in [4] and in [11]. Other methods could be considered as well (for instance [32]). The fact that our systems can work on top of any CM localization algorithm is in fact a remarkable strength of the approach. Moreover, since different methods (e.g. patch-match based or keypoints-based) have often different peculiarities and work better in different conditions, the best CM localization method could be chosen based on the kind of images under analysis.

We have proposed a method for source-target disambigua- tion in copy-move forgeries. This problem has not gained much attention in the past, yet solving the disambiguation problem is of primary importance to correctly localize the tampered region in a copy-move forgery. Common existing algorithms, in fact, identify both the original (source) and copied (target) region, yet only the target region corresponds to a tampered area. To address this problem, we leveraged on the capability of deep neural network architectures to learn suitable features for exposing the target region, by looking at the presence of interpolation artifacts and boundary inconsistencies. Specifically, we proposed an architecture with two multi-branch CNNs that extract different features and perform the disambiguation independently; then, decision fusion is applied at the score level. Our tests show that our disambiguation method, called DisTool, performs well even in the realistic testing scenario, where the copy-move binary localization mask is provided by a CM detection algorithm and the CM transformation is estimated from such mask. Based on our tests, the proposed architecture trained on a synthetic dataset achieves good results also on copy-move images from realistic public datasets, then the generalization capability of the method is also good.

As a future work, we could investigate other strategies to perform fusion of the network outputs. In particular, methods based on machine learning could be adopted, e.g. an SVM or a random forest classifier. Another interesting possibility would be to resort to fuzzy logic fusion [33]. A fuzzy fusion module could also be integrated in the multi-branch CNNs architecture, that could then be trained as a whole. In this way, the weights of the fuzzy logic module could also be optimized through backpropagation [34]. Finally, the analysis of the multi-target copy moves scenario, i.e., the case (n-m), with m > n, could also be considered as future research. In this case, a pre-processing could be carried out to trace back the problem to the solution of several (1-1) problems, that can then be solved using the DisTool architecture presented in this paper.

This work has been partially supported by a research sponsored by DARPA and Air Force Research Laboratory (AFRL) under agreement number FA8750-16-2-0173. The U.S. Government is authorised to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA and Air Force Research Laboratory (AFRL) or the U.S. Government.

The authors would also like to thanks Giulia Boato from the University of Trento for advice and financial support.

[1] V. Christlein, C. Riess, J. Jordan, C. Riess, and E. Angelopoulou, “An evaluation of popular copy-move forgery detection approaches,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 6, pp. 1841–1854, Dec 2012.

[2] W. Tan, Y. Wu, P. Wu, and B. Chen, “A survey on digital image copy- move forgery localization using passive techniques,” 2019.

[3] A. J. Fridrich, B. D. Soukal, and A. J. Lukáš, “Detection of copy- move forgery in digital images,” in in Proceedings of Digital Forensic Research Workshop. Citeseer, 2003.

[4] D. Cozzolino, G. Poggi, and L. Verdoliva, “Efficient dense-field copy- â ˘A¸Smove forgery detection,” IEEE Trans. on Information Forensics and Security, vol. 10, no. 11, pp. 2284–2297, 2015.

[5] H. Huang, W. Guo, and Y. Zhang, “Detection of copy-move forgery in digital images using sift algorithm,” in 2008 IEEE Pacific-Asia Workshop on Computational Intelligence and Industrial Application, vol. 2, Dec 2008, pp. 272–276.

[6] I. Amerini, L. Ballan, R. Caldelli, A. Del Bimbo, and G. Serra, “A sift-based forensic method for copyâ ˘A¸Smove attack detection and transformation recovery,” IEEE Transactions on Information Forensics and Security, vol. 6, no. 3, pp. 1099–1110, Sep. 2011.

TABLE V: Accuracy (%) of DisTool for the end-to-end system on all the 4 datasets. The CM detectors DF-CMFD [4] and BusterNet-CMFD [11] are considered.


[7] E. Silva, T. Carvalho, A. Ferreira, and A. Rocha, “Going deeper into copy-move forgery detection: Exploring image telltales via multi-scale analysis and voting processes,” Journal of Visual Communication and Image Representation, vol. 29, pp. 16 – 32, 2015. [Online]. Available:

[8] Y. Rao and J. Ni, “A deep learning approach to detection of splicing and copy-move forgeries in images,” in 2016 IEEE International Workshop on Information Forensics and Security (WIFS). IEEE, 2016, pp. 1–6.

[9] Y. Liu, Q. Guan, and X. Zhao, “Copy-move forgery detection based on convolutional kernel network,” Multimedia Tools and Applications, vol. 77, no. 14, pp. 18 269–18 293, 2018.

[10] Y. Wu, W. Abd-Almageed, and P. Natarajan, “Image copy-move forgery detection via an end-to-end deep neural network,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 1907–1915.

[11] ——, “BusterNet: Detecting copy-move image forgery with source/target localization,” in Proc. of ECCV 2018, 2018, pp. 170–186.

[12] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a ‘’siamese‘’ time delay neural network,” in Proc. of NIPS, 1993, pp. 737–744.

[13] I. Amerini, L. Ballan, R. Caldelli, A. D. Bimbo, L. D. Tongo, and G. Serra, “Copy-move forgery detection and localization by means of robust clustering with j-linkage,” Signal Processing: Image Communication, vol. 28, no. 6, pp. 659 – 669, 2013. [Online]. Available:

[14] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.

[15] H. Hotelling, “Analysis of a complex of statistical variables into principal components,” Journal of Educational Psychology, pp. 417–441, 1933.

[16] D. Cozzolino and L. Verdoliva, “Noiseprint: a CNN-based camera model fingerprint,” CoRR, vol. abs/1808.08396, 2018. [Online]. Available:

[17] O. Mayer and M. C. Stamm, “Learned forensic source similarity for unknown camera models,” in Proc. of ICASSP, 2018, pp. 2012–2016.

[18] M. Huh, A. Liu, A. Owens, and A. A. Efros, “Fighting fake news: Image splice detection via learned self-consistency,” in Proc. of ECCV, 2018.

[19] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Proc. of CVPR, 2005, pp. 539–546.

[20] G. Koch, C. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in Proc. of ICML DL workshop, 2015.

[21] J. Dong, W. Wang, and T. Tan, “CASIA image tampering detection evaluation database,” in Proc. of IEEE CS and Int. Conf. on SIP, July 2013, pp. 422–426.

[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of CVPR, 2016, pp. 770–778.

[23] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. of ICML, 2015, pp. 448–456.

[24] D.-T. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato, “RAISE: A raw images dataset for digital image forensics,” in Proc. of MMSys, 2015, pp. 219–224.

[25] T. Gloe and R. Böhme, “The ‘Dresden Image Database’ for bench- marking digital image forensics,” in Proc. of SAC, vol. 2, 2010, pp. 1585–1591.

[26] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva, “VISION: a video and image dataset for source identification,” EURASIP Journal on Information Security, vol. 2017, no. 1, 2017.

[27] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “SUN database: Large-scale scene recognition from abbey to zoo,” in Proc. of CVPR, 2010, pp. 3485–3492.

[28] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. Zitnick, “Microsoft COCO: Common objects in context,” in Proc. of ECCV, 2014, pp. 740–755.

[29] D. Cozzolino, G. Poggi, and L. Verdoliva, “Copy-move forgery detection based on patchmatch,” in Proc. of ICIP, 2014, pp. 5312–5316.

[30] V. Christlein, C. Riess, J. Jordan, C. Riess, and E. Angelopoulou, “An evaluation of popular copy-move forgery detection approaches,” IEEE Trans. on Information Forensics and Security, vol. 7, no. 6, pp. 1841– 1854, 2012.

[31] P. Pérez, M. Gangnet, and A. Blake, “Poisson image editing,” ACM Trans. on Graphics, vol. 22, no. 3, pp. 313–318, 2003.

[32] Y. Li and J. Zhou, “Fast and effective image copy-move forgery detection via hierarchical feature point matching,” IEEE Trans. on Information Forensics and Security, 2018.

[33] T. Terano, K. Asai, and M. Sugeno, Fuzzy Systems Theory and Its Applications. San Diego, CA, USA: Academic Press Professional, Inc., 1992.

[34] S.-B. Cho and J. H. Kim, “Multiple network fusion using fuzzy logic,” IEEE Transactions on Neural Networks, vol. 6, no. 2, pp. 497–501, 1995.