Image Hashing by Minimizing Independent Relaxed Wasserstein Distance

2020·Arxiv

ABSTRACT

ABSTRACT

Image hashing is one of the fundamental problems that demand both efficient and effective solutions for various practical scenarios. Adversarial autoencoders are shown to be able to implicitly learn a robust, locality-preserving hash function that generates balanced and high-quality hash codes. However, the existing adversarial hashing methods are inefficient to be employed for large-scale image retrieval applications. Specifically, they require an exponential number of samples to be able to generate optimal hash codes and a significantly high computational cost to train. In this paper, we show that the high sample-complexity requirement often results in sub-optimal retrieval performance of the adversarial hashing methods. To address this challenge, we propose a new adversarialautoencoder hashing approach that has a much lower sample requirement and computational cost. Specifically, by exploiting the desired properties of the hash function in the low-dimensional, discrete space, our method efficiently estimates a better variant of Wasserstein distance by averaging a set of easy-to-compute one-dimensional Wasserstein distances. The resulting hashing approach has an order-of-magnitude better sample complexity, thus better generalization property, compared to the other adversarial hashing methods. In addition, the computational cost is significantly reduced using our approach. We conduct experiments on several real-world datasets and show that the proposed method outperforms the competing hashing methods, achieving up to 10% improvement over the current state-of-the-art image hashing methods. The code accompanying this paper is available on Github1.

CCS CONCEPTS

Information systems Image search; Information retrieval; Content analysis and feature selection; • Theory of computation

Adversarial learning; • Computing methodologies ral networks; Unsupervised learning.

KEYWORDS

Hashing, neural networks, adversarial autoencoders, optimal transport, wasserstein distance

1 INTRODUCTION

The rapid growth of visual data, especially images, brings many challenges to the problem of finding similar items. Exact similarity search, which aims to exhaustively find all relevant images, is often impractical due to its computational complexity. This is due to the fact that a complete linear scan of all the images in such massive databases is not feasible, especially when the database contains millions (or billions) of items. Hashing is an approximate similarity search method which provides a principled approach for web-scale databases. In hashing, high-dimensional data points are projected onto a much smaller locality-preserving binary space via a hash function , where m is the dimension of the binary space. Approximate search for similar images can be efficiently performed in this binary space using Hamming distance [25]. Furthermore, the compact binary codes are storage-efficient.

The existing hashing methods can be broadly grouped into supervised and unsupervised hashing. Although supervised hashing offers a superior performance, unsupervised hashing is more suitable for large databases because it learns the hash function without any labeled data. One of the widely used unsupervised hashing techniques is Locality Sensitive Hashing (LSH) [25]. Image hashing can also be sub-categorized as shallow hashing methods, such as Spectral Hashing (SpecHash) [33] and Iterative Quantization (ITQ) [16], and deep hashing methods, such as SSDH and DistillHash [14, 35, 36].

Even though existing methods have shown some reasonable performance improvements in several image-hashing applications, they have two main drawbacks: (1) their objective functions are heuristically constructed without a principled characterization of the goodness of the hash codes, and (2) the gap between the desired, discrete solution and the relaxed, continuous solution is minimized heuristically with explicit constraints. The latter increases the time to tune the additional hyperparameters of the models. The authors of [11, 12] show that employing adversarial autoencoders for hashing avoids these explicitly constructed constraints. Their hashing model implicitly learns the hash functions by adversarially match the hash functions’ output with a target, supposedly optimal discrete prior. This removes the need for time-consuming hyperparameter tuning. Training adversarial autoencoders by minimizing the Jensen-Shannon or Wasserstein distance is, however, difficult, especially when the dimension of the latent space increases. The scaling difficulty of the adversarial autoencoders may be related

, , Khoa D. Doan, Saurav Manchanda, Sarkhan Badirli, and Chandan K. Reddy

to one fundamental issue: the generalization property of matching distributions. The authors of [2] show that Jensen-Shannon divergence and Wasserstein distance do not generalize, in a sense that the generated distribution cannot converge to the target distribution without an exponential number of samples. Without good generalization, the retrieval performance can be sub-optimal.

To address these aforementioned challenges, we propose a novel unsupervised Discrete Component-wise Wasserstein Autoencoder (DCW-AE) model for the image hashing problem. The proposed model implicitly learns the optimal hash function using a novel and efficient divergence minimization framework. The main contributions of the paper are as follows:

• Demonstrate that the ability to match the distribution of the output of the learned hash function to the target discrete distribution matching is closely related to the retrieval performance. Specifically, employing a distance with an easier convergence to the target distribution (called generalization) results in better retrieval performance. To this end, the existing Wasserstein-based Adversarial Autoencoders have poor generalization; thus they have a sub-optimal retrieval performance.

• Propose a novel, efficient approach to learn the hash functions by employing a more generalizable variant of the Wasserstein distance, that leverages the discrete properties of hashing. It has an order-of-magnitude better generalization property and an order-of-magnitude more efficient computation than the existing Wasserstein-based hashing methods.

• Demonstrate the superiority of the proposed model over the state-of-the-art hashing techniques on various widely used real-world datasets using both quantitative and qualitative performance analysis.

The rest of the paper is organized as follows. We discuss the related work in Section 2. In Section 3, we describe the details of the proposed method. Finally, we present quantitative and qualitative experimental results in Section 4 and conclude our discussion in Section 5.

2 RELATED WORK

In this section, we begin by discussing the related work in the image hashing domain, with main focus towards the motivation behind adversarial autoencoders. Then, we continue our discussion on adversarial learning and their limitations, especially on their generalization property when matching to the target distribution (or sample complexity requirement).

2.1 Image Hashing

Various supervised and unsupervised methods have been developed for hashing. Examples of supervised hashing methods include [5, 13, 30, 34, 37] and examples of unsupervised hashing methods include [9, 10, 16, 18–21, 26, 29, 33, 36]. While supervised methods demonstrate a superior performance over unsupervised ones, they require human-annotated datasets. Annotating massive-scale datasets, which are common in the image hashing domain, is an expensive and tedious task. Furthermore, besides the train/test distribution-mismatch problem, supervised methods easily get stuck in bad local optima when labeled data are limited. Thus, exploring the unsupervised hashing techniques is of great interest, especially in the image-hashing domain.

Hashing methods can also be categorized as either data independent or dependent. One of the most popular data-independent hashing technique is LSH [25]. Data-dependent hashing includes popular methods such as SpecHash [33] and ITQ [16]. Data-dependent hashing demonstrates a significant increase in retrieval performance because it considers the data distribution. Hashing methods can also be categorized as shallow [15, 19, 33] and deep hashing [9, 10, 21, 26, 36]. The deep hashing methods can learn nonlinear hash functions and have shown superiority over the shallow approaches.

In general, the existing hashing methods learn the hash functions by minimizing the following training objective:

where is the data distribution, L(x, f (x)) is the locality-preserving loss of the hash function is a hashing constraint with as its corresponding weight (thus a hyperparameter to tune).

The authors of [11, 12] show that by matching the latent space of the autoencoder with an optimal discrete prior, we can implicitly learn a good hash function f while simultaneously satisfying the constraints . However, their adversarial methods are unstable in practice and do not show a good generalization property. In Section 4, we will show that poor generalization results in a sub-optimal retrieval performance when the model is trained with stochastic optimization techniques such as Stochastic Gradient Descent (SGD).

2.2 Adversarial Learning

Generative Adversarial Network (GAN) has recently gained popularity due to its ability to generate realistic samples from the data distribution [17]. A prominent feature of GAN is its ability to “implicitly” match output of a deep network to a pre-defined distribution using the adversarial training procedure. Furthermore, adversarial learning has been leveraged to regularize the latent space, as it helps in learning the intrinsic manifold of the data [12, 28]. For example, the adversarially trained autoencoders can learn a smooth manifold of the data in the low-dimensional latent space [28]. However, training the adversarial autoencoders remains challenging and inefficient because of the alternating-optimization procedure (minimax game) between the generator and the discriminator. For instance, the work presented in [12] employs the original minimax game [17], which suffers from mode-collapse and vanishing gradient [1]. Moreover, in the minimax optimization, the generator’s loss fluctuates during training instead of “descending”, making it extremely challenging to know when to stop training the model.

Wasserstein-based adversarial methods overcomes a few of these limitations (specifically, mode-collapse and vanishing gradient) [1, 32]. However, because they employ the Kantorovich-Rubinstein dual, the minimax game still exists between the generator and the critic. On the other hand, the work in [11, 22] directly estimates the Wasserstein distance by solving the Optimal Transport (OT) problem. Solving the OT has two main challenges. Firstly, its computation cost is where N is the number of data

Figure 1: The network architecture of the proposed DCW-AE model. During the training phase, the parameters of trained, while the parameters of h are either fixed or trained. During the hashing phase, the vector b is thresholded to obtain the hash code c.

points and d is the dimension of the data points. This is expensive. Secondly, the OT-estimate of the Wasserstein distance requires an exponential number of samples to generalize (or to achieve a good estimate of the distance) [2]. In practice, both the highcomputational cost and exponential-sample requirement make the OT-based adversarial methods very inefficient.

Sliced Wasserstein Distance (SWD), on the other hand, approximates the Wasserstein distance by averaging the one-dimensional Wasserstein distances of the data points when they are projected onto many random, one-dimensional directions [8]. SWD is more generalizable than the OT, with polynomial sample complexity [7]. Furthermore, the SWD estimate has a computational cost of , hence its computational complexity is better than that of the OT . Nevertheless, in the high dimensional space, it becomes very likely that many random, one-dimensional directions do not lie on the manifold of the data. In other words, along several of these directions, the projected distances are close to zero. Consequently, in practice, the number of random directions is often larger than . For example, in [8], for a mini-batch size of 64, SWD needs 10, 000 projections, which is significantly larger than 641, to generate good visual images. To address this problem, Max-SWD finds the best direction and estimates the Wasserstein distance along this direction [7].

In this paper, we address the limitations of these GAN-based approaches by robustly and efficiently minimizing a novel variant of the Wasserstein distance. By carefully studying the properties of the target distribution in hashing, the proposed adversarial hashing method significantly more efficient than both the existing OT-based and SWD-based approaches.

3 PROPOSED METHOD

3.1 Problem statement

Given a data set images where , the goal of unsupervised hashing is to learn a hash function that can generate binary hash code of the image x. m denotes the length of the hash code b and it is typically much smaller than d.

3.2 Network architecture

We propose the DCW-AE network. Figure 1 shows the architecture of DCW-AE. Similar to the existing image hashing approaches [35, 36], we choose to employ a feature extractor, such as the VGG network [31], and represent an image by its extracted feature vector x. In particular, the feature extractor is defined as the function h : is the pixel-representation of the image. Note that, h can be trained in an end-to-end framework similar to that of [12]. However, in our paper, we choose to use the pretrained VGG-feature extractor h and do not retrain its parameters.

The encoder, represented by the function , computes the low-dimensional representation b. Given the feature vector x of an image, the output b = f (x) is represented by the m independent probabilities , where is the parameter of the encoder. To generate the hash codes, we simply compute . Note that the encoder f is also the hash function

, , Khoa D. Doan, Saurav Manchanda, Sarkhan Badirli, and Chandan K. Reddy

Table 1: Notations used in this paper.

the pixel vector and the extracted feature vector of the image, respectively

feature extractor, encoder, and decoder, respectively.

parameters of the feature extractor, encoder, and decoder, respectively.

Wasserstein distance and its empirical estimate, respectively

vector that defines the projection onto a random one-dimensional direction.

adversarial matching loss (also called the distributional distance).

for our purpose. We then regularize the posterior distribution of b, called with a predefined, discrete prior by minimizing their distributional distance . The decoder, represented by the function , reconstructs the input, denoted as ˆx. We train our model by minimizing the reconstruction loss

In the following sections, we will discuss the details of the proposed method, especially the novel, alternative formulation of the Wasserstein distance calculation which significantly improves the retrieval results of the existing adversarial autoencoders.

3.3 Locality preservation of the hash codes

The autoencoder is trained to minimize the mean-squared error between the input and the reconstructed output, as below:

It is easy to show that minimizing the reconstruction loss is equivalent to preserving locality information of the data in the original input space [6]. In other words, the proposed autoencoder model learns the hash function f that preserves the original input locality. Our approach to preserve the locality of the input in the discrete space using an autoencoder is different from the approaches taken by SSDH [35] and DistillHash [36]. These methods heuristically constructs the semantic, pairwise similarity matrix from the representation x. Consequently, the retrieval performance closely depends on the quality of the representation x and the constructed similarity matrix. As we shall see in Section 4, when a good feature extractor h is not available, our approach significantly outperforms these methods.

3.4 Implicit optimal hash function learning

We regularize the the encoder’s output b to match a pre-defined binary prior. Specifically, we sample a vector z as the real data. Each component of z is independently and identically sampled from a one-dimensional Bernoulli distribution with a parameter p. The sampling procedure defines a distribution over z while the encoder defines a distribution over the latent space b. The encoder, which is the generator in the GAN game, learns its parameters by minimizing the Wasserstein distance as follows:

where is the set of all possible joint distributions of b and z whose marginals are and , respectively, and d(b,z) is the cost of transporting one unit of mass from b to z.

Given a finite, finite, , one approach is to minimize the empirical Optimal Transport (OT) cost as follows:

where M is the assignment matrix, D is the cost matrix where and is the Hadamard product. This Linear Programming (LP) program has the following constraints:

The best method of solving the OT’s LP program has a cost of approximately [4], where N is the number of examples. While it is entirely possible to implement this LP program in a Stochastic Gradient Descent (SGD) training for small mini-batch sizes 2, it is computationally expensive for larger N. As discussed in Section 2, the OT has an exponential sample complexity. This requirement makes the OT less suitable in practice, where large mini-batches are necessary for the models to perform well.

3.5 Discrete Component-wise Wasserstein Distance

In our experiments, we observe that it is difficult to match to by solving the OT. We conjecture that the reason is because the OT is a poor estimate of the Wasserstein distance. It has high variance when the number of samples N in the mini-batches is small [22]. On the other hand, SWD is a more “generalizable” distance estimate than the OT [7]. Generalization refers to the number of samples the algorithm needs to converge to the target distribution.

Figure 2: The hash output b (fake, scattered points) versus the prior, target hash codes z (real, part of the corner clusters) in the 2-D discrete space. In Figures 2a and 2c , we want to push the data points closer to the corners. Clearly, projections onto the axes (black vertical and horizontal lines) result in the most correction. In Figures 2b and 2d, we can clearly see this effect.

The empirical SWD of two samples D and F is estimated as follows:

where is a vector which defines the projection onto a random, one-dimensional direction and is the number of such random projections. While SWD has a better sample complexity than the OT, it needs a high number of random directions.

In hashing, and especially in matching the latent space of b to the target, discrete prior , each sample z of lies at the vertices of the hypercube. Without loss of generality, assume that z is two dimensional, therefore in (this is similar to sampling where the only difference is the activation function tanh instead of sigmoid after the logit output of the encoder f ). A discrete sample z falls into one of the four possible corners, as seen in Figures 2a and 2c. Figures 2b and 2d show the one-dimensional Wasserstein distances for different directions at different angles from the vector (1, 0). In Figure 2b, when the generated data points b are further away from the corners, the projections onto the directions along the axes (those with angles 45for different integer values r) have the most distances, thus best describe the separation between the samples of b and z. The projections onto other directions will underestimate this separation between the samples of z and b. In Figure 2d, when more data points are closer to the

where and . Solving the OT in the one-dimensional space has a significantly small computational cost [8]. The cost of such operation is equivalent to the one-dimensional array sort plus the distance calculation. Therefore, the Wasserstein-p distance can be calculated as follows:

where is the sorting operation applied to dimension and are the ranked values of the sets and , respectively. The cost of solving the proposed Wasserstein distance estimate is . Since m is fixed and smaller than N, this is an order of magnitude faster than the high-dimensional OT’s computational cost of . We can also show that the proposed calculation is a valid distance measure and its sample complexity is polynomial.

Theorem 3.1. The proposed Wasserstein-p calculation is a valid distance and it has a polynomial sample complexity.

Proof. Let be the one-hot vector whose component k is 1. We can see that and are projections of the data matrices onto each dimension k. By setting , this is exactly the formulation of SWD. Thus, by a similar proof in [7], it is trivial to show that the proposed Wasserstein estimate is a valid distance and has a polynomial complexity.

Unlike SWD which employs a large number of random directions, the proposed estimate uses the directions that best separate the generated b and the real data z. Similar to SWD, estimating the Wasserstein distance from the m one-dimensional projections has a polynomial sample complexity. This is an important advantage over the OT estimation, which has an exponential sample complexity.

The proposed estimate is also related to Max-SWD [7]. MaxSWD finds the single best projected dimension that best describes the separation of the two samples, by employing the discriminator that classifies real and fake data points. This results in the problematic minimax game. Our proposed calculation can estimate the similar averaged distance without the discriminator. For example, in Figure 2b, we can show that, given the optimal discriminator, Max-SWD finds the single direction whose distance would be the average of the distances along the 0-direction and 90-direction. Our approach will also estimate the same average, but without using the discriminator.

, , Khoa D. Doan, Saurav Manchanda, Sarkhan Badirli, and Chandan K. Reddy

Sample a minibatch of Compute the feature vectors j = {1, ..., N }. Sample

We call the adversarial autoencoder with the proposed loss calculation Discrete Component-wise Wasserstein AutoEncoder (DCWAE). The objective function of the DCW-AE can be written as follows:

We summarize the training algorithm of DCW-AE in Algorithm 1. While we can have a single minimization step on L, we find that alternatively minimizing works better in practice. (similar to Alternating Least Squares approach traditionally used in matrix factorization). Specifically, we minimize for a few steps l on every minimization step of . In all of our experiments, we set l = 5. Note that this is not a minimax game because different losses and are not related through a divergence or a value function, as in WGAN [1] and Jensen-Shannon GAN [17].

4 EXPERIMENTS

In this section, we present the experimental results to demonstrate the effectiveness of the proposed hashing method over the existing adversarial autoencoders and other hashing methods.

4.1 Datasets Used

We utilize the following datasets in our performance evaluation experiments:

• : This dataset consists of of 70,000 digit images. We randomly select 10,000 images as the query set and the remaining images for the training and retrieval sets.

• CIFAR10 [24]: This dataset consists of 60,000 natural images categorized uniformly into 10 labels. We randomly select 1,000 images from each label for the query set and use the remaining images for the training and retrieval sets. Hence, the query set contains 10,000 images and the training/retrieval set contains the same 50,000 images.

• FLICKR25K [3]: This dataset consists of 25,000 social photographic images downloaded from Flickr 4. There are a total of 250 different class labels. We randomly select 20 images from each label for the query set and similarly use the remaining images for the training and retrieval sets. The final query dataset contains 5,000 images and the training/retrieval set contains the same 20,000 images.

• : This dataset consists of 1.8 millions of scenery images organized into 365 categories (labels). We randomly select 10 images from each label for the query set and 500 images from each label for the trainin and retrieval sets. The final query dataset contains 3,650 images and the training/retrieval set contains 182,500 images.

4.2 Evaluation Metrics

For evaluating the performance of the proposed model, we follow the standard evaluation mechanism that is widely accepted for the problem of image hashing - the precision@R (P@R) and mean average precision (MAP). Given the query images, P@R and MAP are calculated as follows:

where N is the size of the retrieval set, R is the number of retrieved images, is the number of all relevant images in this set, Q is the size of the query set and 1 only when the r-th retrieved image is relevant to the query image q; otherwise 0. A retrieved image is relevant if its ground-truth label is the same as the label of the query image.

Table 2: Performance comparison of different methods using P@1000. The best P@1000 value for each experiment is in bold.

4.3 Comparison Methods

We compare the performance of the proposed method with various representative unsupervised image hashing methods.

• Locality Sensitive Hashing (LSH) [25]: a widely-used data-independent, shallow hashing method using random projection.

• Spectral Hashing (SpecHash) [33]: an unsupervised shallow hashing method whose goal is to preserve locality while finding balanced, uncorrelated hashes by solving the Eigenvector problem.

• Iterative Quantization (ITQ) [16]: the state-of-the-art shallow hashing method that alternately minimizes the quantization error to achieve better hash codes.

• Stochastic Generative Hashing (SGH) [6]: a representative hashing method that, similar to the proposed method, also minimizes the reconstruct loss in an autoencoder model.

• Semantic Structure-based Deep Hashing (SSDH) [35]: an unsupervised deep hashing method that learns the hash function by preserving heuristically-defined semantic structure of the data. The semantic structure is extracted from a pre-trained neural network (such as VGGNet).

• Deep Hashing by Distilling Data Pairs (DistillHash) [35]: the state-of-the-art unsupervised deep hashing method that is, in principle, similar to SSDH. However, the semantic structure is constructed by distilling data pairs that are consistent with the Bayes optimal classifier.

• Wasserstein Adversarial Autoencoder (WGAN-AE): adversarial autoencoder model for hashing which employs the critic that estimates the Wasserstein from the dual domain. This is an improved version of the adversarial autoencoder defined in [12].

• OT-Wasserstain Adversarial Autoencoder (OT-AE): the adversarial autoencoder model for hashing which directly minimizes the Wasserstein distance using the OT formulation in the primal domain, (discussed in Section 3.4).

• The proposed method (DCW-AE)

Implementation Details: For the existing shallow hashing techniques, we employ a pre-trained VGG network and extract the pooling-fc7 feature vectors of the images [31]. For our model, we employ VGG for the feature extractor h on CIFAR10, FLICKR25K and PLACE365. For MNIST, we do not use the feature extractor h and directly learn to reconstruct the pixel image, i.e. .

Table 3: Performance comparison of different methods us- ing MAP form = 64. The best MAP values are shown in bold.

The encoder/decoder are multi-layer perceptrons (MLP) with hidden layers as 1000 1000 500 and 500 1000 1000, respectively. We implement our method in pyTorch 6 and train our model using Stochastic Gradient Descent (SGD) along with Adam optimizer [23]. We use a mini-batch size of 128 examples for DCW-AE and WGAN-AE. For OT-AE, we try different mini-batch sizes ranging from 128 to 512. For a fair and strict evaluation, we perform a grid search to find the best hyper-parameters for each of the methods; and report averaged results over five runs (three runs for PLACE365 dataset). The source code and the datasets used in our experiments will be made publicly available on a Github repository upon the acceptance of this paper.

4.4 Performance Results

In this experiment, we measure the performance of various methods for the image retrieval task. Table 2 shows the P@1000 results across different lengths of the hash codes. DCW-AE consistently outperforms all the baseline methods at different lengths of the hash codes. Specifically, DCW-AE has a relative performance improvement of more than 10% in CIFAR10 and FLICKR25K. Similarly, in Table 3, we report the MAP results for all the methods at different lengths of the hash codes. Again, DCW-AE consistently achieves the best MAP results. The improvements of our method over the baselines are statistically significant according to the corresponding paired t-tests (p-value < 0.01).

, , Khoa D. Doan, Saurav Manchanda, Sarkhan Badirli, and Chandan K. Reddy

Figure 3: Performance comparison of various methods using Precision-Recall curves of code lengths of 32 and 64 bits.

Table 4: An illustration of the top-10 retrieved MNIST digits for a given query image using code length (m) of 32 and 16 bits.

The P@1000 and MAP results demonstrate the superiority of our method compared to various state-of-the-art approaches for the image hashing problem. One important result is the improvement in performance of DCW-AE compared to OT-AE. This supports our claim that the existing adversarial autoencoders cannot learn the optimal hash function compared to our DCW-AE model. This demonstrates the significance of a generalizable Wasserstein estimate. A lower sample-complexity estimate makes it easier for the algorithm to converge to the target distribution using mini-batch training algorithms.

4.5 Ablation Study

We further evaluate the effectiveness of the proposed adversarial learning procedure using an ablation study. Figure 3 shows the Precision-Recall curves of the different Autoencoder-based hashing models. AE denotes the vanilla Autoencoder without any adversarial regularization. SWD-AE is the Adversarial Autoencoder whose adversarial matching cost is SWD. We observe that all adversarialbased Autoencoder models outperform AE. This demonstrates the importance of the adversarial learning for Autoencoders in image hashing. Furthermore, replacing the dual Wasserstein estimate (in WGAN-AE) and the OT estimate (in OT-AE) with our proposed

Figure 4: T-SNE embedding of the generated discrete hash codes on the MNIST dataset. The digit identities are color-coded.

estimate further improves the retrieval performance. DCW-AE’s performance is significantly better than that of OT-AE and SWDAE. We hypothesize that this improvement is primarily due to the following reasons:

• The proposed distance estimate of DCW-AE has a better sample complexity (i.e., more generalizable) than OT. Better sample complexity allows mini-batch optimization techniques such as SGD to converge better to the target distribution.

• The proposed distance estimate of DCW-AE is a better distance compared to SWD. Our distance estimate induces a weaker topology than SWD [1]. In other words, our estimate makes it easier to converge to the target distribution.

4.6 Qualitative Evaluation

In Table 4, we show the top-10 retrieved digit images of two query images corresponding to digits 3 and 9. DCW-AE method has successfully retrieved relevant digits; when the retrieved images are false positives, we can still see that they contains similar appearances (e.g. some handwritten digit 4’s are similar to the digit 9). As expected, when increasing the size of the binary code) (m), the model makes fewer mistakes.

Qualitatively, we can visually compare the quality of the hash codes generated by DCW-AE, OT-AE and two best non-adversarial hashing baselines, ITQ and DistillHash. Figure 4 shows the twodimensional t-SNE embeddings [27] of the generated hash codes on the query set. In this example, the similarity matrix of SSDH is constructed directly from the image-pixel space. Notice that SSDH generates unreliable hash codes. This shows that, without a reliable construction of the similarity matrix, the retrieval performance of both SSDH and DistillHash is significantly deteriorated. On the other hand, DCW-AE learns a very efficient discrete embedding of the original data; for example, DCW-AE can even separate the most similar digits 9 and 4.

4.7 DCW-AE’s Wasserstein Estimation

In this Section, we show the efficiency of the proposed Wassertein estimate and compare it with SWD. Figure 5 shows the distance estimates and the gradient norm of the parameters (training of the proposed calculation in DCW-AE and of the original SWD on the CIFAR-10 dataset. The SWD is estimated with different

Figure 5: Wasserstein distance estimates (left) and the gradient norm of the encoder’s parameters (right) during training of DCW-AE and SWD (with varying

number of random projections . On the extreme case, when 1, both the distance and the gradient fluctuates siginificantly during training. This makes adversarial training become very unstable. When is higher, we can observe that the distance estimates of SWD are lower. This is because the one-dimensional Wasserstein distances across several random directions are very small and do not contain useful signal for training (see the corresponding gradient). Even when increasing from 10 to 10K, the estimate is not generally better. On the other hand, the estimates of DCW-AE are higher and its gradient is also more stable. This is due to the fact that the proposed distance estimate averages the distances of directions along which the are most dissimilar.

4.8 Computational efficiency of DCW-AE

In this experiment, we compare the training time of the proposed method and the existing adversarial hashing methods. The training time of WGAN-AE, OT-AE and DCW-AE are shown in Figure 6 for three datasets, CIFAR10, FLICKR25K and PLACE365. We report the average training time per epoch. In Figure 6, the training time of DCW-AE is significantly reduced compared to the training of OT-AE.

, , Khoa D. Doan, Saurav Manchanda, Sarkhan Badirli, and Chandan K. Reddy

Figure 6: Comparison of training times (in minutes) per one epoch on various datasets (using a mini-batch size of 128 examples).

5 CONCLUSION

We proposed a novel adversarial autoencoder model for the image hashing problem. Our model learns hash codes that preserves the locality information in the original data. Our model has a much better generalization property than the existing adversarial approaches, thus is able to achieve significant performance gains. Furthermore, the proposed model trains significantly faster than the existing Wasserstein-based adversarial autoencoders. Our experiments validate that the proposed hashing method outperforms all the existing state-of-the-art image hashing methods. Our work makes one leap towards leveraging an efficient, robust adversarial autoencoder for the image hashing problem and we envision that our model will serve as a motivation for improving other adversarially-trained hashing models. The code accompanying this paper is available on Github7.

REFERENCES

[1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).

[2] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. 2017. Generalization and equilibrium in generative adversarial nets (gans). In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 224–232.

[3] Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression. Computers & Graphics 71 (2018), 77–87.

[4] Rainer E Burkard, Mauro Dell’Amico, and Silvano Martello. 2009. Assignment problems. Springer.

[5] Yue Cao, Bin Liu, Mingsheng Long, and Jianmin Wang. 2018. Hashgan: Deep learning to hash with pair conditional Wasserstein GAN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1287–1296.

[6] Bo Dai, Ruiqi Guo, Sanjiv Kumar, Niao He, and Le Song. 2017. Stochastic generative hashing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 913–922.

[7] Ishan Deshpande, Yuan-Ting Hu, Ruoyu Sun, Ayis Pyrros, Nasir Siddiqui, Sanmi Koyejo, Zhizhen Zhao, David Forsyth, and Alexander G Schwing. 2019. MaxSliced Wasserstein Distance and its use for GANs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10648–10656.

[8] Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. 2018. Generative modeling using the Sliced Wasserstein Distance. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3483–3491.

[9] Kamran Ghasedi Dizaji, Feng Zheng, Najmeh Sadoughi Nourabadi, Yanhua Yang, Cheng Deng, and Heng Huang. 2018. Unsupervised Deep Generative Adversarial Hashing Network. In CVPR 2018.

[10] Thanh-Toan Do, Anh-Dzung Doan, and Ngai-Man Cheung. 2016. Learning to hash with binary deep neural network. In European Conference on Computer Vision. Springer, 219–234.

[11] Khoa D Doan and Chandan K Reddy. 2020. Efficient Implicit Unsupervised Text Hashing using Adversarial Autoencoder. In Proceedings of The Web Conference 2020. 684–694.

[12] Khoa D Doan, Pranjul Yadav, and Chandan K Reddy. 2019. Adversarial Factorization Autoencoder for Look-alike Modeling. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2803–2812.

[13] Tiezheng Ge, Kaiming He, and Jian Sun. 2014. Graph cuts for supervised binary coding. In European Conference on Computer Vision. Springer, 250–264.

[14] Kamran Ghasedi Dizaji, Feng Zheng, Najmeh Sadoughi, Yanhua Yang, Cheng Deng, and Heng Huang. 2018. Unsupervised deep generative adversarial hashing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3664–3673.

[15] Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In Vldb, Vol. 99. 518–529.

[16] Yunchao Gong, Svetlana Lazebnik, Albert Gordo, and Florent Perronnin. 2013. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2916–2929.

[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems. 2672–2680.

[18] Kaiming He, Fang Wen, and Jian Sun. 2013. K-means hashing: An affinitypreserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2938–2945.

[19] Jae-Pil Heo, Youngwoon Lee, Junfeng He, Shih-Fu Chang, and Sung-Eui Yoon. 2012. Spherical hashing. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2957–2964.

[20] Chen Huang, Chen Change Loy, and Xiaoou Tang. 2016. Unsupervised learning of discriminative attributes and visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5175–5184.

[21] Shanshan Huang, Yichao Xiong, Ya Zhang, and Jia Wang. 2017. Unsupervised Triplet Hashing for Fast Image Retrieval. In Proceedings of the on Thematic Workshops of ACM Multimedia 2017. ACM, 84–92.

[22] Akihiro Iohara, Takahito Ogawa, and Toshiyuki Tanaka. 2019. Generative model based on minimizing exact empirical Wasserstein distance. https://openreview. net/forum?id=BJgTZ3C5FX

[23] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[24] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).

[25] Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of massive datasets. Cambridge university press.

[26] Kevin Lin, Jiwen Lu, Chu-Song Chen, and Jie Zhou. 2016. Learning compact binary descriptors with unsupervised deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1183–1192.

[27] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.

[28] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015).

[29] Ruslan Salakhutdinov and Geoffrey Hinton. 2009. Semantic hashing. International Journal of Approximate Reasoning 50, 7 (2009), 969–978.

[30] Fumin Shen, Chunhua Shen, Wei Liu, and Heng Tao Shen. 2015. Supervised discrete hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition. 37–45.

[31] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[32] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. 2017. Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 (2017).

[33] Yair Weiss, Antonio Torralba, and Rob Fergus. 2009. Spectral hashing. In Advances in neural information processing systems. 1753–1760.

[34] Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and Shuicheng Yan. 2014. Supervised hashing for image retrieval via image representation learning.. In AAAI, Vol. 1. 2.

[35] Erkun Yang, Cheng Deng, Tongliang Liu, Wei Liu, and Dacheng Tao. 2018. Semantic structure-based unsupervised deep hashing. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. 1064–1070.

[36] Erkun Yang, Tongliang Liu, Cheng Deng, Wei Liu, and Dacheng Tao. 2019. Distillhash: Unsupervised deep hashing by distilling data pairs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2946–2955.

[37] Huei-Fang Yang, Kevin Lin, and Chu-Song Chen. 2018. Supervised learning of semantics-preserving hash via deep convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 40, 2 (2018), 437–451.