We have been witnessing a continuing proliferation of new datasets, largely spurred by the record-breaking success of deep neural networks and the ubiquity of data-generation and sharing tools. In spite of such an abundance of datasets, having the right dataset well-suited for a target application is not guaranteed in practice. The performance of a machine learning model is largely dependent on the availability of a relevant, adequate, and balanced dataset that well represents the distribution of the application-specific sample space. However, it is quite often that real-world applications accompany small-sized or poorly organized data of their own. Certain practices are commonly exercised, such as transferring from a model pretrained with another dataset or augmenting the given training data with samples from other datasets (Bengio, 2011; Ciresan et al., 2010; Dai Wenyuan et al., 2007; Ge & Yu, 2017; Perez & Wang, 2017; Seltzer et al., 2013; Yosinski et al., 2014). Still, it is often not obvious to foresee the compatibility of one of the available known datasets with respect to the target sample space.
In this paper, we present SimEx, a new method for early prediction of inter-dataset similarity. Specifically, SimEx takes unknown data samples as input to a set of autoencoders each of which was pretrained to reconstruct a specific, distinct part of known data. Then, the differences between the input samples and the reconstructed output samples are evaluated. Our intuition is that, the more underlying similarity the unknown data samples share with the specific part of known data that an autoencoder was trained with, the better chances there could be that this autoencoder makes use of its learned knowledge, reconstructing output samples closer to the originals. Here, the differences between the original inputs and the reconstructed outputs constitute a relative indicator of similarity between the unknown data samples and the specific part of known data.
SimEx implies a number of practical benefits. Empirical evidences support that the similarity between data is correlated with the effectiveness of transferring a pretrained network (Yosinski et al., 2014) or supplementing the training samples from a relevant dataset (Dai Wenyuan et al., 2007; Ge & Yu, 2017). Not only this fundamental benefit, the properties in how SimEx predicts the data similarity could lead to further advantages. It is likely that SimEx takes account of self-organized deep features about the given data, beyond the sample space measures such as pixel-level distribution (Fuglede & Topsoe, 2004; Kullback & Leibler, 1951) or structure-level similarity (Oliva & Torralba, 2001; Wang et al., 2004). Notably, despite potential use of deep features, SimEx does not measure a difference metric directly from the latent space; it takes the measurement back to the sample space where the reconstructed output appears. An implication from this property is that it may help alleviate possible model biases to the particular way that a model represents the deep features. Another benefit is from the systems perspectives. At comparison time, SimEx predicting the similarity with respect to an unknown dataset does not require any further training, because the autoencoders were pretrained. This property saves a considerable amount of runtime resources at comparison time, unlike the existing practice of inferring the relevance between datasets by transferring a network from one dataset to another and measuring the resulting performance deviation (Yosinski et al., 2014). Our experiments show that SimEx is more than 10 times faster than the existing transfer learning-based methods at runtime to predict the similarity between an unknown dataset and reference datasets.
We note that the term ‘similarity’ is not a single rigorously defined measure. Various similarity measures have been de-fined, each with a target-specific focus (Larsen et al., 2016). Yet, we believe that a similarity measure making little assumption on the data or task would be favorable to wide usability. In this paper, we do not claim that SimEx reflects a ‘true’ similarity of any kind. Instead, this paper focuses on experimental exploration of the usability and benefits of SimEx-predicted similarity in the context of typical transfer learning and data augmentation.
Our contributions are threefold. First, we present a new method predicting the similarity between data, which is essentially: have a set of autoencoders learn about known data, represent unknown data with respect to those learned models, reconstruct the unknown data from that representation, and measure the reconstruction errors in the sample space which is considered an indicator of the relative similarity of the unknown data with respect to the known data. Second, we devise applying our method to three cases of similarity prediction: the inter-dataset similarity, the inter-class similarity across heterogeneousdatasets, and the inter-class similarity within a single dataset. Third, we demonstrate the clear speed advantage and potential usability of our method in making informed decisions in the practical problems: transferring pretrained networks, augmenting a small dataset for a classification task, and estimating the inter-class confusion levels in a labeled dataset.
Quantifying the similarity between two different datasets is a well-studied topic in machine learning. A theoretical abstraction of data similarity can be cast to the classic KL-divergence (Kullback & Leibler, 1951). For ‘shallow’ datasets, empirical metrics such as Maximal Mean Discrepancy (MMD) (Borgwardt et al., 2006) are popular choices. For images, the structural similarity metric (SSIM) is a well-known metric taking account of luminance, contrast, and structural information (Wang et al., 2004). However, for learning with high-dimensional data such as complex visual applications, it is challenging to directly apply these shallow methods.
A related topic is domain adaptation; a model trained on the dataset in the source domain is extended to carry out the task on the data in the target domain. For shallow features, techniques such as Geodesic Flow Sampling (GFS) (Gopalan et al., 2013), Geodesic Flow Kernel (GFK) (Gong et al., 2012), and Reproducing Kernel Hilbert Space (RKHS)(Ni et al., 2013) are well-established. Deep methods have been proposed recently, such as Domain Adversarial Neural Networks (DANN) (Ganin et al., 2016), Adversarial Discriminative Domain Adaptation (ADDA) (Tzeng et al., 2017) and Deep Adaptation Network (DAN) (Long et al., 2015). However, the goal of domain adaptation is fundamentally different from data similarity. Our objective is to directly compute the data similarity without being tied to a specific machine learning method or a specific neural network architecture.
A problem space relevant to data similarity is to improve the output image quality of generative models, such as by having the models incorporate the structural knowledge (Snell et al., 2017; Yan et al., 2016) or perform intelligent sharpening (Mansimov et al., 2015). For SimEx, however, more accurate reconstruction of a source sample is preferable but not imperative. The primary interest of SimEx is to discern the relative differences of reconstruction quality between input samples. For this purposes, modest reconstruction quality is still acceptable and may be even favored considering the extensive system resources likely to be consumed to achieve high-quality reconstruction.
In this section, we list a few practical scenarios where our similarity prediction using SimEx could be potentially ben-eficial in solving some real-world problems. In each scenario, SimEx is used in slightly different ways, but under the same methodology.
3.1 Inter-dataset similarity
Transferring from an existing classifier pretrained with a relevant dataset is frequently exercised to obtain a good classifier for a new dataset, especially when the new dataset is small. Conversely, the accuracy deviation of the transferred model is often an implicit indicator of the similarity of the new dataset with respect to the known dataset (Yosinski et al., 2014).
Suppose a service provider who trains classifiers for small datasets coming from a number of customers. The service provider would possess in its library a set of classifiers pretrained from various datasets. Upon receiving the customer dataset, transfer learning from one of the pretrained classi-fier can be performed, to learn a new classifier in a short amount of time, and from small number of samples.
However, choosing the right model to transfer from is a non-trivial problem. Intuitively, it makes sense to choose the dataset that is the most ‘similar’ to the target dataset. The best practice is to try all the classifiers in the library and choose the best, at the expense of the high computing/time cost each time a transfer learning is performed.
SimEx-predicted similarity between the datasets can be used as a proxy to the which classifier will be the best for transfer learning. Later in this paper, we will show that SimEx can achieve a meaningfully consistent result compared to the actual transferred quality. At the same time, SimEx achieves more than an order of magnitude faster runtime latency since it does not involve training.
3.2 Inter-class similarity across heterogeneous datasets
It is a common data augmentation strategy to supplement a given set of data with relevant existing, deformed, or synthesized samples (Ciresan et al., 2010; Perez & Wang, 2017; Seltzer et al., 2013). If the task is classification, it is logical to supplement each class with samples that are highly relevant to that class. Suppose that we have many ‘reference’ datasets already labeled, potentially some of which might be relevant to a new target classification problem concerning a new, insufficient dataset.
Hence it is crucial to answer the following questions: which existing dataset is most beneficial to strengthen the target classification model? and further, which classes of the existing datasets are the most beneficial to each target class?
We believe this falls into where SimEx can provide some information, by predicting similarities between classes among different datasets. For example, if we have a ‘food’ class in the target dataset, we can try to see which one among ‘flower’ class or ‘fish’ class is better for supplementing. Again, SimEx saves the huge runtime cost of trial-and-error method, where each target class is supplemented with arbitrary or hand-picked reference class and the accuracy is checked after the training.
3.3 Inter-class similarity within a single dataset
Classification is a very common kind of machine learning problems. Many conventional applications utilizing clas-sifiers would be interested in only a single class label of the highest softmax output. But it is known that the clas-sifiers are not equally confident of every class, but exhibit varying levels of confusion between different input-output class pairs (Delahunt et al., 2019).
Knowing potential inter-class confusion in advance would bring practical benefits in real-world problems. Real datasets may have ill-labeled samples in some classes. Even some classes may be suboptimally separated at the first place. For example, a dataset of fruit classes {apple, banana, kiwi, clementine, tangerine} would yield higher confusion between two citrus classes. One may also want to know potential inter-class confusion when a new class has been introduced to an existing classifier. Thus, when training a classifier with a dataset of unknown characteristics, early screening of potential inter-class confusion may help make informed actions, e.g., re-clustering problematic classes, before training the classifier with the given dataset as-is. It could save considerable time and resources invested to possible trials and errors.
We can hypothesize that classifiers would suffer from more confusion among ‘similar’ classes, and SimEx can give a rough forecast on it by predicting the similarities between the classes.
Throughout this paper, we use the notation A(X) to denote an autoencoder pretrained with samples . Suppose that we have a set of known data samples readily available Y, which is a union of smaller disjoint parts of data:
. In practice, Y could be a union of independent datasets
, or without loss of generality, Y could be a labeled dataset consisting of classes
. Our methods utilize a set of autoencoders
. Each
is specialized in reconstructing a specific part of the data space (i.e.,
out of the whole Y).
In this section, we present the key properties of our autoencoders, and their implications in predicting the similarityorder of multiple sets of data with respect to a reference set of data. Figure 1a illustrates examples from an autoencoder trained to reconstruct the samples of digit 4. For each row, the left shows the input samples and the right shows the cor-
Figure 1: Layer specifications, operation diagrams, and reconstruction examples.
responding output samples. The topmost row demonstrates the results from the input samples of digit 4 taken from the test set. Not surprisingly, each output sample resembles its input sample very closely. The 2nd and 3rd rows show the examples with digit 7 and 9, respectively; their reconstructions look slightly degraded compared to the examples with digit 4, but the reconstructions look still fairly close. In contrast, the 4th and 5th rows, showing the examples with digit 3 and 5, exhibit severely degraded results in both samples. Here we can hypothesize that the knowledge learned from reconstructing 4 has been useful for reconstructing 7 and 9 than for 3 and 5 and that 4 could be thought as to be closer to 7 and 9.
Figure 1b illustrates the diagrams of pretraining and performing comparisons with SimEx. is pretrained with
which is a part of reference data Y. An input sample
from an unknown set X is given to
reconstructs an output sample
. Here, we evaluate
, i.e., the difference between the input
and the output
. There could be various choices of the function
, although this paper applies the same loss function used during training
which is either MSE or iSSIM.
denotes the mean of all
resulting from
. We conjecture that
would be a predictor of a similarity metric of the unknown set X with respect to the reference set
, which grows as X is more dissimilar from
. Furthermore, for multiple unknown sets of data X, W, ... , Z, we conjecture that
, ... ,
would predict the ordered similarity of X, W, ... , Z with respect to
. For example, by letting
on the results in Figure 1a, we obtain an ordered list
in an increasing order of dissimilarity with respect to the set of digit 4.
We conjecture that may reflect not only the apparent similarity at the sample space but also certain ‘deep criteria’ based on the knowledge of Y that A(Y ) learned at pretraining. If unknown dataset X embeds more deep features compatible with those of Y , we may observe smaller
from A(Y ). Note that, unlike task-specific supervised models, the deep features learned and extracted in SimEx are task-agnostic. However, we acknowledge that these are yet hypothetical and not straightforward to verify. The results shown in Figure 1a suggest a possible order of digits in terms of their similarities to the digit of 4. Although this particular order likely concur with visually perceived similarity between digits, reasoning such an ordering may not be always obvious.
For the rest of the paper, we demonstrate the similarity relation predicted by SimEx could be a useful indicator applicable to the typical context of transfer learning and data augmentation. We also discuss that SimEx implies runtime advantages in making an informed data-selection decision.
In this section, we explore the potential usefulness of SimEx through a series of experiments in which SimEx is applied to predict the ordered similarity relationship among datasets and among classes. We juxtapose our results with the similarity relationship from typical baseline methods, and find the correlations in between. Specifically, we experiment with five publicly available MNIST-variant datasets: MNIST (LeCun et al., 1998) (denoted by MNIST or M hereafter), rotated MNIST (ROTATED or R ) (Larochelle et al., 2007), background-image MNIST (BGROUND or B ) (Larochelle et al., 2007), fashion MNIST (FASHION or F ) (Xiao et al., 2017), and EMNIST-Letters (EMNIST or E ) (Cohen et al., 2017). E
Figure 2: Layer specifications of the autoencoders used in this paper.
consists of 26 classes of English alphabet, while other datasets have 10 classes each. Figure 3 illustrates the datasets.
The autoencoders used in this paper have a symmetric architecture whose encoder part is largely adopted from LeNet-5 (LeCun et al., 1998), while SimEx is open to other autoencoder models. Figure 2 lists the layer specifications. For loss functions, we used the mean squared error (denoted MSE hereafter) or the inverted structural similarity metric (denoted iSSIM hereafter, i.e., SSIM) (Wang et al., 2004), while SimEx is open to other choices of loss function. We conducted our experiments on machines with one Intel i7-6900K CPU at 3.20 GHz and four NVIDIA GTX 1080 Ti GPUs.
Figure 3: MNIST-variant datasets used in the experiments.
5.1 Predicting inter-dataset similarity
In this experiment, we demonstrate the quality of the SimEx-predicted inter-dataset similarities as explained in Section 3. We train 5 autoencoders, one for each of our datasets. Then each autoencoder is given the samples from the other 4 datasets, and the resulting is computed. We conducted the same experiments with MSE and iSSIM losses for SimEx. Figure 4a and 4b depict the relative
levels from SimEx with MSE and iSSIM loss functions, respectively.
For baseline, we trained five 10-class classifiers using LeNet-5 model, one for each dataset. For EMNIST, we used only the first 10 classes (A through J). After each base network is trained, we froze all the convolution layers of the base networks and retrained each network’s FC layers with the four other datasets, resulting in (5 base networks) (4 different datasets per base dataset) = 20 retrained networks. Figure 4c depicts the accuracy of retrained networks normalized by the original accuracy of the base network. We balanced the number of samples across classes and across datasets. Each dataset was divided to 5:1 for training and testing. We normalized the color charts for visibility, as only the relative differences matter below.
Table 1: Spearman’s between SimEx-predicted orders and retrained accuracy orders
To assess the consistency between the SimEx-predicted similarity relationship and the baseline of retrainingimplied similarity relationship, we compare the ordered lists of datasets from both methods: one ordered by the test losses
from each A(Y ) and the other ordered by the retrained accuracy transferred from each base network (denoted by B(Y )), where Y denotes base dataset that A(Y ) and B(Y ) were previously trained with. For example, A(M) with MSE predicts the loss-ordered list of (M, R, E, F, B) (shown in Figure 4a), while retraining B(M) gives the accuracy-ordered list of (M, E, R, F, B) as shown in Figure 4c.
Table 1 lists Spearman’s rank correlation coefficients () from each pair of SimEx-predicted and retrain-implied similarity orderings, both sharing the same base dataset. Spearman’s
is a popular measure of the correlation between ordinal variables, ranging between [-1, 1] where 1 means a pair of identically ordered lists and -1 means fully reversely ordered. The table shows reasonably high correlations. The results indicate that using the iSSIM loss function in SimEx yields better correlations, possibly due to the higher robustness in taking account of the structural similarity.
This experiment presents a supportive example that SimExpredicted similarity relationship would be reasonably consistent with those implied by conventional transfer learning practices. If consistent indeed, then what would be the advantage of SimEx? It is the runtime efficiency, as pre-
Figure 4: Inter-dataset similarity. Darker means: (a), (b) smaller losses; (c) higher accuracy.
dicting the similarity relationship by using SimEx does not require any training at comparison time. We demonstrate the latency experiment results in the following subsection.
5.2 Latency of predicting inter-dataset similarity
In many real applications, estimating the similarity between data could be a practical issue in terms of computational complexity and the latency involved therein. For example, popular service platforms should deal with massive influx of new data; a mid-2013 article reports that more than 350 million photos being uploaded to Facebook every day1, and the CEO of YouTube revealed that 400 hours of content was being uploaded to YouTube every minute in mid-20152. Upon incoming arbitrary data, predicting the characteristics of the new data with respect to the reference data or models that various service APIs rely on could be an early step that occurs frequently. Furthermore, interactive or real-time services are highly latency-sensitive, such as product identification by mobile computer vision3.
In this experiment, we demonstrate the latency measurements in the context of the previous experiments presented in Section 5.1, i.e., predicting the inter-dataset similarity by SimEx and the baseline methods based on transfer learning. In both approaches, there are two types of latency: (1) the one-time latency to build the pretrained models for each reference dataset, i.e., scratch-training the autoencoders (SimEx) and the classifiers (baselines), and (2) the runtime latency to compare an arbitrary dataset against the reference datasets, i.e., inferring with the autoencoders (SimEx) and transferring from the pretrained classifiers (baseline). As the latter takes place whenever new incoming data is to be compared against the reference datasets, we conjecture that the runtime latency would matter much more in many services with large-scale or interactive requirements. Be-
1https://www.businessinsider.com/facebook-350-million-photos-each-day-2013-9 2https://www.tubefilter.com/2015/07/26/youtube-400-hours-content-every-minute/ 3https://www.amazon.com/b?ie=UTF8&node=17387598011
low we discuss the latency results for the latter, followed by the results for the former.
Figure 5 depicts the mean runtime latency values averaged from 5 measurements per configuration. The error bars are omitted for brevity as the training times are highly consistent and thereby the standard deviations are insignificantly small, e.g., around 2% of the mean. The configurations include SimEx and 5 different baseline configurations denoted by TL-1 through TL-5 with varying optimizers and learning rates. (TL stands for ‘transfer learning’.) The latency values reported here are for pair-wise comparison, i.e., comparing the unknown dataset against one reference dataset. For SimEx, the latency indicates the total time to have all samples in the new dataset forward-pass an autoencoder pretrained with respect to a reference dataset. For baselines, the latencies indicate the time elapsed until the transfer learning hits the minimum loss. All measurements were done on the identical hardware and framework setup.
SimEx features 2.103 seconds per inference which is more than 10 times faster than the best performing baseline at TL-1 configuration that completes within 22.07 seconds / 8 epochs with RMSprop optimizer and its default learning rate () to transfer the reference classifier with respect to the unknown dataset. Other baseline configura-tions of TL-2 through TL-4 that use different optimizers and their default learning rates exhibit worse transfer latencies, i.e., 32.24 – 40.20 seconds at 12 – 19 epochs, respectively. In fact, it would be an aggressive strategy to use the default learning rates for transfer learning; it is often a common practice to fine-tune the transferring model with a smaller learning rate. TL-5 reflects such a case in that the learning rate is reduced to 10% of the default value used in TL-1. Not surprisingly, TL-5 latency is roughly 10 times longer than TL-1. Further increasing the learning rates beyond the default values might accelerate the run- time latencies of the baselines, but we observed degrading classification accuracy in exchange.
Figure 5: Latencies for pair-wise similarity prediction.
Now that we verified SimEx’s runtime latency advantage outperforming the baselines by more than an order of magnitude, we examine the one-time latency to build the pretrained models for each reference dataset.
We found a tricky issue with training the SimEx autoencoders that the loss slowly ever-decreases for a very long time, taking 5+ hours / 4000+ epochs to reach the minimum loss. In contrast, for the baselines, scratch-training the base classifiers converges at their minimum losses a lot more quickly, e.g., taking 23.49 – 33.45 seconds / 7 – 14 epochs. Even though training for the reference datasets takes place only one time per reference dataset and it could be done of-fline before services are active, SimEx’s one-time latency overheads seem unarguably too large to be tolerated.
To circumvent, we note that SimEx does not have to train its autoencoders all the way to their best, because SimEx does not pursue high-quality reconstruction; what matters is only the relative difference between reconstructed results. Therefore, we can safely stop the training earlier at the point where the relative similarity orderings converge, and it may be reachable not necessarily with best-trained autoencoders yielding high-quality reconstruction.
We examined this strategy. In Figure 6a, the red solid line depicts the results for SimEx trained with RMSprop opti- along varying training epochs of 3, 5, 7, 10, 25, 50, and 100. The x-axis represents the time elapsed until each epoch, and the error bars represent the min and max values from 5 repeated measurements at each point. The vertical lines labeled BT-1 through BT-4 represent the one-time latencies for the baseline methods to train their base classifiers (BT stands for ‘base training’).
At each training epoch, we compared the similarity orderings inferred by the interim autoencoder model at that epoch against the similarity orderings from the baseline
Figure 6: Interim training time vs. Spearman’s correlation
method, which is represented by Spearman’s correlation coefficients () in Figure 6a. We found that, at 203.6 sec / 50 epochs, the interim autoencoder’s
reaches the final model’s
with zero standard deviation across 5 repeated measurements, and remains the same for further epochs. If we tolerate a slight difference, the interim autoencoder achieves a stable, non-fluctuating
at 42.95 sec / 10 epochs, which is slightly less than the twice of the fastest-training baseline classifier.
Note that we used a reduced learning rate in Figure 6a below the default value of RMSprop (). We have experimented various combination of optimizers and learning rates, from which the configuration in Figure 6a exhibit the quickest and highest convergence towards the final
. Counterintuitively, we found a general trend that a learning rate smaller than the default rate per optimizer actually helps the model reach the final
more quickly. For comparison, Figure 6b illustrates the configuration with RMSprop and its default learning rate of
, along with another configuration out of many we have experimented. We leave further investigation on this counterintuitive trend to the future work.
The results presented in this experiment were collected under iSSIM loss across all measurements. We observed very similar trends under MSE loss, thereby omitting the results.
In this section, we investigate the ability of SimEx on predicting the Inter-class similarities (Section 3) for supplementing datasets with small number of samples with that of large number of samples.
To explore the potential of SimEx with regard to such questions, we conducted a set of pilot experiments. We consider M a new dataset, and {E, R, B, F} the reference datasets. We train a 10-class MNIST classifier based on LeNet-5, but with as few as 10 samples per class from MNIST itself. We supplement each class of the training set with a varying number of samples borrowed from a third-party class that belongs to one of the reference datasets. Depending on the relevance of this third-party class to the MNIST class it is supplementing, it would be harmful or beneficial to the accuracy of our MNIST classifier. Our hypothesis is that SimEx-predicted similarity would help make informed pairings between third-party classes and MNIST classes in favorable ways. Formally, let denote the i-th class of M. We supplement
with the samples from the k-th class
of Y , where
. We varied the num- ber of original MNIST samples
from 10 to 100. For each
, we varied the number of supplementing sam- ples
from 10 to 1000.
To facilitate informed pairings between and
, we apply SimEx and two baseline methods for comparison, namely sample-MSE and embeddings. In SimEx, we pretrained per-class autoencoders
for all
{E, R, B, F} and all classes
thereof. Then the MNIST samples
are tested by all
, forming a matrix of
values for all i and k for each Y . The (i, k) pair of the least
decides the first pairing of
. Next pairs are iteratively decided in an increasing order of
in the way that each i and k is chosen only once. In sample-MSE, we estimate the similarity between all combinations of
based on their mean Eu- clidean distances directly in the sample space, followed by the same pairing step. sample-MSE is the baseline to compare SimEx-predicted similarity with those derived from the the sample space distances. In another baseline embeddings, the samples of
and
are input to the per-dataset autoencoder A(Y ). Then we compute the euclidean distances between the ‘embeddings’ of
and
that appear at the bottleneck of A(Y ), followed by the same pairing step. We have embeddings as a baseline in order to compare SimEx-predicted similarity with those derived from the embedding space distances. It is reported that embedding space distances often helps distill the semantic relevance (Frome et al., 2013; Mikolov et al., 2013), although it is controversial that a neural network model may be fooled to find short distances in embedding-space between visually unrelevent noisy inputs (Nguyen et al., 2015).
Figure 7 enumerates the results of combinations of
, method}. The first, second, and third rows represent SimEx, sample-MSE, and embeddings methods, respectively. The x-axes represent
, the number of supplementary training samples bor- rowed from the class
of the reference dataset Y , and the y-axes represent the 10-class MNIST classification accuracy at testing. The separate lines in each chart distinguish
, the number of original MNIST samples. The error bars represent the min and max accuracy out of 10 trials per combination. For testing, we used 1000 MNIST samples per class. Note that the SimEx results in the first row were obtained with MSE loss. Our SimEx experiments with iSSIM loss exhibited little difference from those with MSE loss here, thereby we omit the charts due to the page limit.
Several trends are visible. Not surprisingly, more MNIST samples available at training yield higher accuracy at testing. Borrowing more samples from third-party classes yield either (roughly) monotonically increasing or decreasing accuracy. For Y = B, it appears that SimEx made the more beneficial pairings compared to the two baselines under the same condition; adding more samples from B as informed by SimEx results in a larger boost of test accuracy. To reason, SimEx led to the pairings such that the M classes 0 through 9 correspond to the B classes 0 through 9 in the same order. However, sample-MSE determined that is closer to
than
and vice versa, resulting in the flipped pairings between the classes 1 and 8. embeddings determined even more garbled pairings, resulting in further degraded accuracy. For Y = R, all three methods determined the identical pairings, resulting in similarly increasing accuracy. For Y = E, the pairings determined by SimEx are neither convincingly beneficial nor harmful. However, the baselines exhibit slightly declining accuracy, implying that their pairings are less optimal. Detailed class-by-class pairings for EMNIST are listed in the supplementary materials.
These experiments imply that SimEx could outperform the baseline methods in predicting the class-wise similarity between a new dataset and a given reference dataset, such that augmenting the new dataset’s classes as per determined pairings is beneficial or less harmful to the original classifi-cation task, compared to the alternative pairings. A remaining question is how SimEx would provide informed knowledge to choose the right reference dataset out of many.
Figure 7: Classification accuracy after supplementing training samples from a heterogeneous dataset.
Note that, for two reference datasets V , W and an unknown dataset and
resulting from A(V ) and A(W), respectively, may not be directly comparable to each other. Especially when there is a large sample complexity difference between V and W, it may result in a largely different dataset-wide error offsets at testing. R and B would be such a case. As a preliminary attempt for regularization, we normalized the
values shown in Figure 4a by the dataset-wide mean L2-norm of all samples for each
. For example,
,
, ... ,
are normalized by the mean L2-norm of all the samples
), and
, ... ,
are by the mean L2-norm of
), and so on. Interestingly, the post-normalization order among all
where X = M was:
, which happens to be identical to the reverse order of positive accuracy boosts by SimEx exhibited in Figure 7, i.e., B > R > E > F. Still this is by no means conclusive, but it is an interesting observation worth investigating further.
5.4 Predicting inter-class similarity within a single dataset
In this experiment, we explore the usability of SimExpredicted similarity for early screening of inter-class confusion in a given dataset. We apply SimEx and sample-space distances for inter-class similarity (Section 3) prediction. SimEx methods are analogous to Section 4 such that 10 autoencoders are trained, one per MNIST class, denoted A(0), A(1), ..., A(9). We train two sets of such autoencoders, one with MSE (denoted SimEx-MSE) and the other with iSSIM (denoted SimEx-iSSIM). The sample-space distances are the baselines. For each pair of classes, mean sample distances are measured by two metrics, each denoted sample-MSE and sample-iSSIM, respectively.
We compare the inter-class similarity predicted by these methods against the confusion levels observed from MNIST classifiers. To observe the confusion levels, we attempted to adopt the distribution of MNIST classifier’s outputs. However, training a MNIST classifier converges very quickly; even after a few epochs, testing samples from the k-th class produces insignificantly small values at non-k
Figure 8: Inter-class similarity / confusion in MNIST dataset. Darker means more similar / confusing.
class output neurons. For a reliable alternative, we trained ten of 9-class classifiers, denoted for
. The training set of each
is missing
, the samples of class k. At testing,
yields nontrivial confusion levels from
, which the other methods’ similarity predictions are compared against. To mitigate possible model biases, we use two sets of 9-class classifiers from LeNet-5 and ResNet-18, respectively. These methods are denoted 9class-Le5 and 9class-Res18, respectively.
Table 2: Mean of Spearman’s between inter-class similarity/confusion results by different methods
Figure 8 depicts the similarity / confusion results. We found that the results from SimEx-iSSIM and sample-iSSIM exhibit the trends very similar to those from SimEx-MSE and sample-MSE, thereby omitting their results. The different contrasts of the colormaps are not necessarily an issue as we are interested in their relative orders. We assessed the consistency of similarity / confusion orderings in the same manner as in Section 5.1. Table 2(left) lists the mean values of Spearman’s between SimEx-s and 9class-s, as well as between SimEx-s and the baseline sample-s. Given
, the similarity predicted by both SimEx- methods are of reasonably high consistency with the confusion observed from both 9class-s, as well as with the similarity predicted by the baseline sample-s to equivalent extent. These results might seem that SimEx- methods are no better than sample-s. But Table 2(right) lists the
values between sample-s and 9class-s; SimEx is noticeably outperforming the baselines.
These results indicate that the similarity orderings from SimEx lie at a middle point between those from samples and 9class-s. The main implication is that SimEx may be more beneficial than the plain sample-space distances (even if the sample-space distance takes account of the structural similarity) in terms of predicting the potential confusion levels of an unknown class with respect to known classes of a dataset, although more experiments are necessary to warrant a strong claim.
In this paper, we proposed SimEx, and experimentally explored the usability and benefits of SimEx-predicted similarity in the context of typical transfer learning and data augmentation. We demonstrated that SimEx-predicted data similarity is highly correlated with a number of performance differences mainly resulting from a dataset selection, e.g., in pretrained network transfer, train set augmentation, and inter-class confusion problems. We also showed that the SimEx’s data similarity prediction exhibits equivalent or outperforming results compared to the baseline data comparisons in sample- or embedding-spaces. Importantly, we demonstrated SimEx achieving more than 10 times speed up in predicting inter-dataset similarity compared to the conventional transfer learning-based methods,as comparisons in SimEx require no training but inferences.
Our pilot results have shown to the community the early potentials that this newly developed method could be usable and advantageous to several exemplary exercises in machine learning problems, especially those with a tight latency bound or a scalability requirement. We believe that further theoretical or empirical studies would guide us to a firm and deeper understanding on this newly developed method.
Bengio, Y. Deep learning of representations for unsupervised and transfer learning. In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning workshop-Volume 27, pp. 17–37. JMLR. org, 2011.
Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.-P., Sch¨olkopf, B., and Smola, A. J. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, 2006.
Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, abs/1003.0358, 2010. URL http://arxiv.org/abs/1003.0358.
Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373, 2017.
Dai Wenyuan, Y. Q., Guirong, X., et al. Boosting for transfer learning. In Proceedings of the 24th International Conference on Machine Learning, Corvallis, USA, pp. 193–200, 2007.
Delahunt, C. B., Mehanian, C., and Kutz, J. N. Money on the table: Statistical information ignored by softmax can improve classifier accuracy. arXiv preprint arXiv:1901.09283, 2019.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129, 2013.
Fuglede, B. and Topsoe, F. Jensen-shannon divergence and hilbert space embedding. In International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings., pp. 31. IEEE, 2004.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17 (1):2096–2030, 2016.
Ge, W. and Yu, Y. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1086–1095, 2017.
Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073. IEEE, 2012.
Gopalan, R., Li, R., and Chellappa, R. Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE transactions on pattern analysis and machine intelligence, 36(11):2288–2302, 2013.
Kullback, S. and Leibler, R. A. On information and suf-ficiency. The annals of mathematical statistics, 22(1): 79–86, 1951.
Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. An empirical evaluation of deep architectures on problems with many factors of variation. In Proceedings of the 24th international conference on Machine learning, pp. 473–480. ACM, 2007.
Larsen, A. B. L., Sønderby, S. K., Larochelle, H., and Winther, O. Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning, pp. 1558–1566, 2016.
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. arXiv preprint arXiv:1502.02791, 2015.
Mansimov, E., Parisotto, E., Ba, J. L., and Salakhutdinov, R. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436, 2015.
Ni, J., Qiu, Q., and Chellappa, R. Subspace interpolation via dictionary learning for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 692–699, 2013.
Oliva, A. and Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. International journal of computer vision, 42(3):145–175, 2001.
Perez, L. and Wang, J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621, 2017.
Seltzer, M. L., Yu, D., and Wang, Y. An investigation of deep neural networks for noise robust speech recognition. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 7398–7402.IEEE, 2013.
Snell, J., Ridgeway, K., Liao, R., Roads, B. D., Mozer, M. C., and Zemel, R. S. Learning to generate images with perceptual similarity metrics. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 4277–4281. IEEE, 2017.
Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176, 2017.
Wang, Z., Bovik, A. C., Sheikh, H. R., Simoncelli, E. P., et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Yan, X., Yang, J., Sohn, K., and Lee, H. Attribute2image: Conditional image generation from visual attributes. In European Conference on Computer Vision, pp. 776–791. Springer, 2016.
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pp. 3320–3328, 2014.