In recent years, data volume has grown exponentially, and this has created demands for building statistical methods to analyze huge datasets. In particular, the size of these datasets far exceeds the available computational capability at hand. For instance, it may not be computationally feasible to perform standard statistical procedures on a single machine when the datasets are huge. Although one remedy is to develop sophisticated distributed computing systems that can directly handle large datasets, the increased system complexity makes this approach not suitable for all scenarios. Another remedy to this problem is to employ a subsampled dataset that can be handled by existing computational resources. In fact, such an approach is widely used to solve big data problems. However, subsampling may suffer from loss of statistical accuracy, i.e., the variance of the resulting estimator may be large. Therefore a natural question is to tradeoff statistical accuracy for computational efficiency by designing an effective sampling scheme that can minimize the reduction of statistical accuracy given a certain computational capacity.
In this paper, we examine the subsampling approach for solving large-scale multi-class logistic regression problems that are common in practical applications. The general idea of subsampling is to assign an acceptance probability for each data point and select observations according to the assigned probabilities. After the subsampling procedure, only a small portion of the data are selected from the full dataset. Hence, the model built using the subsampled data will not be as accurate as that of the full data. The key challenge is to design a good sampling scheme together with the corresponding estimation procedure such that the loss of statistical accuracy is minimized, given some fixed computational resource. Here, the required computational resource can be measured by the number of subsampled data.
There has been substantial work on subsampling methods for large-scale statistical estimation problems [1, 2, 3, 4, 5, 6, 7]. The simplest method is to subsample the data uniformly. However, uniform subsampling assigns the same acceptance probability to every data point, which fails to differentiate the importance among the samples. For example, a particular scenario often encountered in practical applications of logistic regression is when the class labels are imbalanced. This problem has attracted significant interests in the machine learning literature (see survey papers in [8, 9]). Generally, there are two types of commonly encountered class imbalance situations: marginal imbalance and conditional imbalance. In the case of marginal imbalance, some classes are much rarer than the other classes. This situation often occurs in applications such as fraud and intrusion detection [10, 11], disease diagnoses [12], protein fold classification [13], and etc. On the other hand, conditional imbalance describes the case when the labels (denoted as y) for most observations (denoted as x) are easy to predict. This happens in applications with very accurate classifiers such as handwriting digits recognition [14] and web/email spam filtering [15, 16]. Note that marginally imbalance implies conditional imbalance, while the reverse is not necessarily true.
For marginally imbalanced binary classification problems, case-control subsampling (CC), which uniformly selects an equal number of samples from each class, has been widely used in practice in epidemiology and social science studies [17]. Under this scheme, equal number of samples are subsampled from each class, and therefore the sampled data are marginally balanced. It is known that case-control subsampling is more efficient than uniform subsampling when the datasets are marginally imbalanced. However, since the acceptance probability relies on the response variable in CC subsampling, the distribution of subsampled data is skewed by the sample selection process [18]. It follows that correction methods are needed to adjust for the selection bias [19, 20]. Another method to remove bias in CC subsampling is to weight each sampled data point by the inverse of its acceptance probability. This is known as the weighted case-control method, which has been shown to be consistent and unbiased [21], but may increase the variance of the resulting estimator [22, 23, 24].
One drawback of the standard case-control subsampling is that it does not consider the case when the data are conditionally imbalanced. This issue was addressed in [15], who proposed an improved subsampling scheme called Local Case-Control (LCC) sampling for binary logistic regression. The LCC method assigns each data point an acceptance probability determined not only by its label but also by its observed covariates. LCC assigns more importance on data points that are easy to be mis-classified according to a consistent pilot estimator, which is an approximate conditional probability estimator possibly obtained using a small number of uniformly sampled data. The method in [15] fits a logistic model with the LCC sampled data, and then apply a post-estimation correction to the resulting estimator using the pilot estimator. Therefore, the LCC sampling approach belongs to the correction based methods such as that of [19, 20]. It was shown in [15] that given a consistent pilot estimator, the LCC estimator is consistent with an asymptotic variance that may significantly outperform that of the uniform sampling and CC based sampling methods when the data is strongly conditionally imbalanced.
In this paper, we propose an effective sampling strategy for large-scale multi-class logistic regression problems that generalizes LCC sampling. Our proposed sampling procedure can be summarized in the following:
(1) it assigns an acceptance probability for each data point and selects observations according to the assigned probabilities;
(2) it fits a multi-class logistic model with the subsampled observations to obtain an estimate of the unknown model parameter.
In the proposed framework, the acceptance probability for each data point can be obtained using an arbitrary probability function. Unlike correction based methods [20, 15] that are specialized for certain models such as linear model, we propose a maximum likelihood estimate (MLE) that integrates the correction into the MLE formulation, and this approach allows us to deal with arbitrary sampling probability and produces a consistent estimator within the original model family as long as the underlying logistic model is correctly specified. This new integrated estimation method avoids the post-estimation correction step used in the existing literature.
Based on this estimation framework, we propose a new sampling scheme that generalizes LCC sampling, which we describe briefly in the following. Given a rough but consistent prediction of the true probability distribution p(y|x), this scheme preferentially chooses data points with labels that are conditionally uncertain given their local observations x based on
. The proposed sampling strategy is therefore referred to as Local Uncertainty Sampling (LUS). We show that the LUS estimator can achieve an asymptotic variance that is never worse than that of the uniform random sampling. That is, we can achieve variance of no more than
times the variance of the full-sample based MLE by using no more than
of the sampled data in expectation. Moreover the required sample size can be significantly smaller than
of the full data when the accuracy of the rough estimate
is high. This generalizes the result for LCC in [15], which reaches a similar conclusion for binary logistic regression when
We also study the case when the model is misspecified. In this case, for binary classification, LUS has the same properties as those of LCC that the subsampling based estimator is consistent to the best estimator for the original population given a consistent pilot estimate. Unfortunately, for general multi-class problems, the LUS estimator is biased. Nevertheless, we empirically find that the LUS method works well for both binary classification and multi-class classification problems even when the model is misspecified.
We conduct extensive empirical evaluations on both simulated and real-world datasets, showing that the experimental results match the theoretical conclusions and the LUS method significantly outperforms the previous approaches.
Our main contributions can be summarized as follows:
• we propose a general estimation framework for large-scale multi-class logistic regression, which can be used with arbitrary sampling probabilities. The procedure always generates a consistent estimator within the original model family when the model is correctly specified. This method can be applied to general logistic models without the need of post-estimation corrections;
• under this framework, we propose an efficient sampling scheme which we refer to as local uncertainty sampling. For any , the method can achieve asymptotic variance no more than that of the uniform subsampling with sampling probability
, using an expected sample size no more than that of the uniform subsampling. Moreover, the required sample size can be significantly smaller than that of the uniform subsampling when the classification accuracy of the underlying problem is high.
For a K-class classification problem, we observe data points from an unknown underlying distribution D, where x is the feature vector and y is the corresponding label. Given a set of n independently drawn observations
from D, we want to estimate K conditional probabilities
for k = 1, 2, . . . , K. This paper considers
multi-class logistic model with the following parametric form:
where is the model parameter for the k-th class and f is a known function. The above model implies that
Let be the parameter for the entire model. Eq. (2.1) is specified in terms of
log-odds with the constraint that the probabilities of each class should sum to one. Note that the logistic model uses one reference class as the denominator in the odds-ratios, and the choice of the denominator is arbitrary since the estimates are equivalent under this choice. We use the K-th class as the reference class throughout the paper.
Moreover, is the maximizer of the expected population likelihood:
where is an indicator function. In the maximum likelihood formulation of multi-class logistic regression, the unknown parameter
is estimated from the data by maximizing the empirical likelihood:
For large-scale multi-class logistic regression problems, n can be extremely large. In such cases, solving the multi-class logistic regression problem (2.4) may be computationally infeasible due to limited computational resources. To overcome this computational challenge, we propose a subsampling framework in the following section.
In this section, we introduce the estimation framework with subsampling for large-scale multi-class logistic regression. There are two main steps:
(1) for every data point (x, y), suppose that the arbitrary sampling probability function [0, 1] is given. For each pair of observation
), generate a random binary variable
, drawn from the {0, 1}-valued Bernoulli distribution
with acceptance probability
(2) keep the samples with for
. Fit a multi-class logistic regression model based on the selected samples by solving the optimization problem
We now derive the above procedure under the assumption that the logistic model is correctly specified as in Eq. (2.2). As we will show later, the acceptance probability used in the first step can be an arbitrary function, and the above method always produces a consistent estimator for the original population. The computational complexity in the second step is reduced to fitting the model with samples after the subsampling step.
Given , we draw a random variable z according to the Bernoulli distribution B(x, y). This gives the following augmented distribution A for the joint variable
{0, 1} with probability
Note that each sampled data pair follows , and the random variable
is independently drawn from
. It follows that each joint data point
is drawn i.i.d. from the distribution A. For the sampled data
, the distribution of random variable (x, y) follows from
If f is correctly specified for D, then the following function family
is correctly specified for A, i.e., the true parameter in Eq. (2.2) also satisfies
Therefore, we have the following logistic model under A:
It follows that given arbitrary sampling probability function can be obtained by using MLE with respect to the new population A:
In practice, the model parameter can be estimated by empirical conditional MLE with respect to the sampled data
which is equivalent to Eq. (3.1). Let be the subsampling based estimator. As we will see in the next section,
is a consistent estimator of
when the model is correctly specified.
In this section, we examine the asymptotic behavior of the proposed method described in Section 3. All of the proofs are provided in Appendix A.
4.1 Consistency and Asymptotic Distribution
First, based on the empirical likelihood in Eq. (3.4), we have the following result for
Theorem 4.1 (Consistency and Asymptotic Normality). Suppose that the parameter space is compact with for all
and
. Moreover, assume the quantities
and
for
are bounded under some norm
for any
Let
. If Eq. (2.2) is satisfied, i.e., the model is correctly specified, then given an arbitrary sampling probability function a(x, y), as
, the following holds:
where is a block diagonal matrix, each block of which is
with notations indicating a(x, k) and p(x, k), respectively.
Theorem 4.1 shows that given an arbitrary sampling probability a(x, k), the proposed method in Section 3 generates a consistent estimator post-estimation correction. This is different from existing methods such as the LCC in [15], which employs post-estimation corrections. One benefit of the proposed method is that without post-estimation correction, we can still produce a consistent estimator in the original model family, and our framework allows different sampling functions for different data points
. For example, in time series analysis, we may want to sample recent data points more aggressively than past data points. This can be handled naturally by our framework, but not by using the post-estimation correction based approach. Another benefit is that the framework can be naturally applied with regularization, which can be regarded as a restriction on the parameter space for
From Theorem 4.1, the estimator is asymptotically normal with variance
Given a data point (x, y), although the sampling probability a(x, y) can be arbitrary, it is important to select a sampling probability such that the variance of the resulting estimator is as small as possible. In the following, we study a specific choice of a(x, y) that achieves this purpose.
4.2 Sampling Strategy
Recall from Eq. (4.2) the definition of S. Let be the corresponding matrix S when we set a(x, k) = 1 for all k, i.e., we accept all data points in the dataset. Then
If we set for all k for some
, i.e., we sample a fraction of
of the full dataset uniformly at random, we denote the corresponding matrix as
. Then,
. In
the following, we denote the asymptotic variances of , the full-sample based estimator and the estimator obtained from
uniformly sampled data as
respectively. Our purpose is to find a better sampling strategy with lower variance than that of uniform sampling. That is, we want to choose an acceptance probability function a(x, k) such that there exists some scalar
The constraint requires the expected subsample size to be no more than , i.e., we sample no more than
fraction of the full data.
Theorem 4.2 (Sampling Strategy). Use the same assumptions and definitions in Theorem 4.1. For any data point x, let
Given any , consider the following choice of the acceptance probability function:
Then, we always have
and the expected number of subsampled examples is
It is easy to check that the assigned acceptance probability in Theorem 4.2 is always valid, i.e., it is a value in [0, 1]. With the sampling strategy in Theorem 4.2, we always use no more than a fraction of the full data to obtain an estimator with variance no more than
times the variance of the full-sample based MLE. In other words, the proposed method is never worse than the uniform sampling method. Moreover, the required sample size
can be significantly smaller than
under favorable conditions.
More precisely, we have the following formula for the expected conditional sampling probability:
Therefore, in favorable case where most for
, i.e., the data are conditionally imbalanced, our method will subsample very few examples to achieve the desired variance compared to that of
random sampling.
An intuitive explanation for this sampling strategy is that if there exists a class k that dominates the other classes for any given , then the sampling probability will be proportional to ‘
’. That is, when the classification accuracy for the problem is high, the proposed sampling method will significantly outperform the random sampling.
For binary classification problem (K = 2), Theorem 4.2 reduces to LCC sampling in [15] when . Although a method to achieve a desired variance for the case of
was also proposed in [15], it is different from our sampling strategy. Moreover, it does not prove that the number of samples needed by LCC for achieving such a desired variance is never worse than that of the uniform sampling for the case of
. In fact, the empirical performance of LCC can be worse than our method under the case of
as we will show in the experimental section.
In the multi-class case, our method is not a natural extension of the LCC sampling, which would imply a method to set all class probabilities to 1/K after sampling. Instead, we will only assign a smaller sampling probability for (x, y) when . The method is less likely to select a sample when the label y coincides with the the prediction of the underlying true model, while it will likely be selected if y contradicts the underlying true model. Since the sampling strategy prefers data points with uncertain labels, we refer it to as Local Uncertainty Sampling (LUS). In the following, we will indicate the estimator as
if the acceptance probability is set according to Eqs. (4.5) and (4.6) (recall that we use the notation
with an arbitrary sampling function a(x, y)).
4.3 Randomness on the Acceptance Probability
In Theorem 4.2, a(x, k) is set as a function of the true probability p(x, k) for k = 1, . . . , K. In practice, we never know p(x, k), because if we do there would not be a need to estimate the model parameters. Instead, we can use a roughly estimated probability to compute the acceptance probability, which we refer to as the pilot estimate.
There are multiple ways to estimate . For example, one can obtain a pilot estimator
by fitting the model using a smaller independent data set, or obtain
from a different and simpler parametric model or non-parametric methods. This is different from the LCC method, which involves post-estimation correction that relies on an explicit pilot estimator
that is additive to the original model parameter. For simplicity, in the following analysis, we still assume that
is computed from a pilot estimator
Recall from Theorem 4.1 that given an acceptance probability, the asymptotic variance for is dependent on
is computed based on
, we see that S is now a function of
and the model parameter
. We rewrite
to emphasize its dependency on
and
. Similarly, we represent
as
. Moreover, with the use of the pilot estimator, from Eq. (3.3),
is now a function of
, and we rewrite it as
In the following, we characterize the asymptotic distribution of when the pilot estimator
is estimated from an independent data set.
be a probability estimate computed using a pilot estimator
such that
where is a constant that is independent of the pilot estimator
Corollary 4.1 implies that as long as the pilot estimator is a consistent estimator of
such that
converges to p(x, k), the randomness induced by the pilot estimator
will not inflate the asymptotic variance of
. The result is in conjunction with that of LCC when the model is correctly specified, and is also a generalization of their results to the case of K > 2.
4.4 Model Misspecification
In practice, the model in Eq. (2.2) may not be correctly specified, so a true parameter may not exist. Under such case, we denote the best estimator obtained by maximizing Eq. (2.3) for the original population D as
and denote the corresponding probability estimate as
for k = 1, . . . , K. Since the model is misspecified, we know that
may not equal to p(x, k).
separately. To distinguish between the two cases, we use lower-case letter to denote the model parameter for K = 2, i.e., binary classification problems. We first have the following results for K = 2.
Proposition 4.1. For K = 2, suppose that the parameter space of is compact that
. Under model misspecification, let
and we have
Moreover, is the unique maximizer of
and the LUS estimator is a consistent estimator of
Proposition 4.1 implies that under model misspecification, if we use a perfect pilot estimate to compute the acceptance probability, then
would also converge to
. Again, in practice, we never know
a priori. Therefore, we will need a pilot estimator
. In the following proposition, we study how randomness of
affects the performance of the LUS estimator when the model is misspecified.
Proposition 4.2. Assume that the pilot estimator is such that
. Set the acceptance probability acceptance probability according to Eqs. (4.5) and (4.6). Under the same conditions in Proposition 4.1, when K = 2 and under model misspefication, we have
Proposition 4.2 implies that as long as the pilot estimator is consistent to ,
is consistent to
. Moreover, under model misspecification, we see that there is an inflation in the asymptotic variance induced by the random pilot estimator
. From Propositions 4.1 and 4.2, when the model is misspecified for K = 2, the LUS method has the same properties as those of the LCC method proposed in [15].
However, when K > 2, the LUS estimator is biased.
Proposition 4.3. Under model misspecification, when K > 2, the LUS estimator is biased even when . However, if there exists a class
where
is the parameter by setting
Proposition 4.3 implies that when , if there exists a majority class such that
then the LUS estimator with respect to the k-th class will be consistent to
when the other parameters
are fixed as
. Explicit characterizations of the bias for the LUS estimator when K > 2 are complex and we put them in Appendix B.
Nevertheless, in our empirical studies, we find that the LUS method works well for both binary classification and multi-class classification problems, even under model misspecification.
In order to apply the sampling strategy in Theorem 4.2 empirically, according to Corollary 4.1, the main idea is to employ a rough but consistent estimate of the probabilities p(x, k) given x, and then assign the acceptance probability according to
. In fact, we only need to have an approximate estimate of q(x) according to Eq. 4.4; that is, we only need to roughly estimate the probability of the majority class given x. Note that
can be obtained via a consistent pilot estimator
(which may be obtained using a small amount of uniformly sampled data) where
. This is similar to what is employed in [15]. However, in practice, one may also use a simpler model family to obtain the pilot estimate, as shown in our MNIST experiment below, or one may use other techniques such as neighborhood based methods [25] for this purpose.
In real-world applications, this rough estimate is often easy to obtain. For example, when data arrives in time sequence, a pilot estimator trained on previous observations can be used for fitting a new model when new observations are arriving. Moreover, a rough estimate obtained on a small subset of the full population can be used for training on the entire dataset. In our experiments, we adopt the latter: a small uniformly subsampled subset of the original population is used to obtain the rough estimate . As we will see later, this choice is sufficient for our method to obtain good practical performance. The LUS algorithm can be described in Algorithm 1.
In this section, we evaluate the performance of the LUS method and compare it with the uniform sampling (US) and case-control (CC) sampling methods on both simulated and real-world datasets. For the CC sampling method, we extend the standard CC sampling considered in the binary classification problem to multi-class case by sampling equal number of data points for each class. Under marginal imbalance, if some minority classes do not have enough samples, we keep all data for those classes and subsample equal number of the remaining data points from other classes. In addition, we also compare the LUS and LCC methods on the Web Spam dataset, which is a binary
classification problem studied in [15]. The experiments are implemented on a single machine with 2.2GHz quad-core Intel Core i7 and 16GB memory.
6.1 Simulation: Marginal Imbalance
We first simulate the case when the data is marginally imbalanced. We generate a 3-class Gaussian model according to , which is the true data distribution D. We set the number of features as
, where
is a
identity matrix. So the true log-odds function f is linear and we use a linear model to fit the simulated data, i.e., the model is correctly specified. Moreover, we set P(Y = 1) = 0.1, P(Y = 2) = 0.8, P(Y = 3) = 0.1, which implies that the data is marginally imbalanced and the second class dominates the population.
the distribution D. For the full dataset, we generate n = 50, 000 data points. The entire procedure is repeated for 200 times to obtain the variance of different estimators. For the LUS method, we randomly generate data points (i.e., an amount of 10% of the full data) from D to obtain a pilot estimator in every repetition. Moreover, we generate another
data points to test the prediction accuracy of different methods.
the following experiments, we will test different values of , respectively. Given the value of
, suppose the LUS method will subsample a number of
data points. Then, we let the US and CC sampling methods select an amount of
examples to achieve fair comparison, because the LUS method has to pay for its usage of a random pilot estimate.
estimator contains a total number of coordinates. Denote
as the coordinate-wise
Figure 1: The value for each coordinate under different values of
denotes the ratio between the variance of each coordinate in the subsampling based estimator and the variance of the coordinate in the full-sample based MLE, i.e.,
Figure 2: Plots in the first simulation.
ratio between the variance of the coordinate in the subsampling based estimator and the variance of the coordinate in the full-sample based MLE. We show the value for each coordinate under different values of
in Fig. 1. In this simulation, there are 42 coordinates. From the figures, we observe that the
value for each coordinate of the LUS method is around
, which matches our theoretical analysis in Theorem 4.2. On the other hand, the variances of the US and CC sampling methods are much higher than that of the LUS method, even when US and CC sample
more data points than the LUS method.
In Fig. 2(a), we plot the relationship between the average for all coordinates against
. From the figure, we observe that the relationship is close to the y = x line (the dashed green line), which implies that
approximately equals
. These experimental results well match our theoretical analysis. Fig. 2(b) reports the relationship between
and
. Fig. 2(c) shows the relationship between the prediction accuracy on the test data and the proportion of used training data
. From the figure, when
decreases, the prediction accuracy of all the methods decreases, while the LUS method shows much slower degradation compared to the US and CC methods. Moreover, according to Fig. 2(c), we only need about 20% of the full data (including those used for computing the pilot estimator) to achieve the same prediction accuracy as the full-sample based MLE, implying that the LUS method is very effective for reducing the computational cost while preserving high accuracy.
6.2 Simulation: Marginal Balance
In this section, we generate marginally balanced data with conditional imbalance. Under this situation, the CC sampling method is identical to US, and hence we omit the CC sampling method from our comparison. The settings are exactly the same as those in the previous simulation, except that we let , which implies that the data is marginally balanced. However, as we will see later, this simulated data is conditionally imbalanced.
Figure 3: The value for each coordinate under different values of
Figure 4: Plots in the second simulation.
The value for each coordinate when
is shown in Fig. 3. The relationship between the average
for all the coordinates and
is plotted in Fig. 4(a). Fig. 4(b) reports the relationship between
and
. In Fig. 4(c), we show the relationship between the prediction accuracy on the test data and the proportion of used training data
. The results are similar to those of the previous simulation and demonstrate the effectiveness of the LUS method under the marginally balanced (but conditionally imbalanced) case. Fig. 4(c) suggests that we only need about 25% of the full data (including those used for computing the pilot estimator) to achieve the same prediction accuracy as that of the full-sample based MLE.
In the simulations, we have fixed the amount of data used for computing the pilot estimate (as ). In general, increasing or decreasing this amount will reduce or increase the variance of the pilot estimate and affect the performance of the LUS estimator accordingly. We provide simulations to show how the performance of the LUS estimator changes with respect to
in Appendix D, and the results suggest that a small
is sufficient to let LUS achieve good performance. Another trick with respect to the pilot estimate is to use two or more phases for the pilot estimate. That is, we can recursively apply an obtained LUS estimator as the pilot estimator for the next round. Since the variance of the LUS estimator is smaller than that of the random sampling based estimator (when they subsample the same amount of data), using multiple phases will reduce the variance of the final estimator compared with using random sampling completely.
Despite of these flexibilities on choosing the pilot estimate, for simplicity, we will use random sampling and fix the portion of data for the pilot estimate in the following experiments.
6.3 MNIST Data
In this section, we evaluate different methods on the MNIST data, which is a benchmark dataset in image classification problems and the state-of-the-art results have achieved less than 1% test error rate on this dataset. Therefore, the classification accuracy of this problem is high. Note that different from the LCC sampling, our LUS method can handle general logistic models. In this experiment, we let the model function f for the LUS estimator to be one of the state-of-the-art deep neural networks. Since we have no knowledge about the underlying true model in this real dataset, the adopted neural network might be misspecified. In order to save computational cost, we use a simpler neural net structure to obtain the pilot estimate
. It is worth mentioning that this simpler neural network is different from the one used for the final LUS estimator, because the LUS method only requires a rough estimate
instead of an explicit estimator which is needed by post-estimation correction based methods. The detailed network structures and parameter settings are provided in Appendix C. For the US method, we apply the same network structure used by the final LUS estimator to achieve fair comparison. Since the MNIST data is marginally balanced, the CC method performs the same with the US method and we omit its comparison here.
The training set consists of 60,000 images and the test set has 10,000 images. We uniformly select data points (i.e., 10% of the training data) to compute the rough estimate
in every repetition and perform 10 repetitions of the experiment to obtain the average performance of different methods. Similar to the simulations, we assume the LUS method samples a number of
data points and let the US method sample
data points. Note that the setting is a bit unfair for LUS because the
data points used for computing
are processed by a simpler neural network. However, we still keep this protocol in the experiment.
We test a number of values of in the range (1, 1.04] and Fig. 5(a) plots the proportion of used data
against
. Fig. 5(c) shows the relationship between the test error (%) and
. Note that the rough estimate has a relatively large error rate of about 3.5%; this is due to the fact that it employs a simpler network structure to save computational cost. Nevertheless, the LUS method can achieve an error rate below 1% using only about 25% of the training data (10% for the pilot estimate and 15% for LUS); with about 45% of the training data (10% for the pilot estimate and 35% for LUS), it achieves the same error rate as that obtained by using the full training data. The LUS method consistently outperforms the US method. Table 1 shows the speedup of the LUS method compared to the full-sample based estimation.
6.4 Web Spam Data: Binary Classification
In this section, we compare the LUS method with the LCC method on the Web Spam datais a binary classification problem used in [15] to evaluate the LCC method. Since the comparison
Figure 5: Plots in MNIST data.
Table 1: Speedup of the LUS method on MNIST data when using 45% of the training set (10% for the pilot estimate and 35% for LUS), i.e., achieving the same error rate with the full-training-sample based MLE.
among the LCC, US and CC methods on this dataset has been reported in [15], we do not repeat them here and focus on the comparison between the LUS and LCC methods. The Web Spam data contains 350,000 web pages and about 60% of them are web spams. This data set is approximately marginally balanced, but it has been shown to have strong conditional imbalance in [15]. We adopt the same settings as described in [15] to compare the LUS and the LCC methods. That is, we use linear logistic model and select 99 features which appear in at least 200 documents, and the features are log-transformed. Note that 10% of the observations are uniformly selected to obtain a pilot estimator as in [15]. Since we only have a single dataset, we follow [15] to uniformly subsample 100 datasets, each of which contains 100,000 data points, as 100 independent ‘full’ datasets, and then repeat the experiments 100 times for comparison.
Observe that when , the LUS and LCC methods are equivalent to each other by setting the parameter
of LCC in Section 3.3 of [15]. Therefore, we only focus on the case of
. In the first experiment, similar to previous experiments, we test different values of
and accordingly set
, so that the two methods have the same asymptotic variance. Then, we will compare the number of subsampled data points to see which method is more effective in terms of subsampled data size
Fig. 6 plots the values for different choices of
. As expected from the theoretical results, both the LUS and LCC methods have the same variance that is approximately
) times variance of the full-sample based MLE. Next, we compare the subsampling proportion of different methods when
changes in Fig. 7. From the figure, the LUS method consistently subsamples fewer data points compared with LCC when they achieve the same variance as shown in Fig. 6.
Alternatively, we fix the proportion of the sampled examples for both LUS and LCC
Figure 6: The value for each coordinate with respect to different values of
in the Web Spam data, where
methods (by carefully choosing ), and we test the variance of the estimators to see which one has lower variance. Table 2 shows the average variance of the coordinates in the LCC estimator and the LUS estimator. We observe that the LUS estimator always achieves lower variance compared to that of the LCC estimator. These results demonstrate that the LUS method is not only theoretically better justified but also more effective than the LCC meethod in practice (for the case of
Table 2: Average variance of the coordinates in the estimators from LCC and LUS.
Figure 7: ) in Web Spam data.
This paper introduced a general subsampling method for solving large-scale multi-class logistic regression problems. We investigated the asymptotic variance of the proposed estimator when the model is correctly specified. Based on the theoretical analysis, we proposed an effective sampling strategy called Local Uncertainty Sampling to achieve any given level of desired variance. We proved that the method always achieves lower variance than random subsampling for a given expected sample size, and the improvement may be significant under the favorable condition of strong conditional imbalance. Therefore the method can effectively accelerate the computation of large-scale multi-class logistic regression in practice. Empirically, we will need a pilot estimate of the probability to setup the acceptance probability. We proved that the variance of the proposed estimator remains a constant which is independent of the randomness of this pilot estimate as long as it is consistent.
We also studied the case of model misspecification. We showed that for binary classification problems (K = 2), the proposed method can generate a consistent estimator (to the best estimator of the misspecified model) if the pilot estimator is consistent. For K > 2, the proposed estimator is biased and we also provided analysis to quantify the bias.
The empirical studies support the theory and demonstrate that the local uncertainty sampling method outperforms the uniform sampling, case-control sampling and the local case-control sampling methods under various settings. By using the proposed method, we are able to select a very small subset of the original data to achieve the same performance as that of the full dataset, which provides an effective mean for big data computation under limited resources.
This work suggests several future directions. First, as we have mentioned at the end of Section 6.2, one can iteratively apply an obtained LUS estimator as the pilot estimator for the next round of fitting the model. This is closely related to the boosting method [26] and a deep discussion on this relationship would be of great interest. Second, considering a situation of online learning with limited budget, the LUS method would likely provide an effective sampling strategy for this problem. Moreover, in high-dimensional settings, where sparse models, e.g., Lasso and Group Lasso, are widely adopted, it would be interesting to extend the LUS method to deal with regularized multi-class logistic regression with special considerations on high-dimensional asymptotic regime.
[1] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Algorithmic Learning Theory, pages 38–53. Springer, 2008.
[2] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In Advances in Neural Information Processing Systems, pages 442–450, 2010.
[3] Paramveer Dhillon, Yichao Lu, Dean P Foster, and Lyle Ungar. New subsampling algorithms for fast least squares regression. In Advances in Neural Information Processing Systems, pages 360–368, 2013.
[4] Paul Mineiro and Nikos Karampatziakis. Loss-proportional subsampling for subsequent erm. arXiv preprint arXiv:1306.1840, 2013.
[5] Yu Xie and Charles F Manski. The logit model and response-based samples. Sociological Methods and Research, 17(3):283–302, 1989.
[6] Bianca Zadrozny. Learning and evaluating classifiers under sample selection bias. In Proceedings of the International Conference on Machine Learning, page 114. ACM, 2004.
[7] Tong Zhang and F Oles. The value of unlabeled data for classification problems. In Proceedings of the International Conference on Machine Learning, pages 1191–1198. Citeseer, 2000.
[8] Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1–6, 2004.
[9] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009.
[10] Naoki Abe, Bianca Zadrozny, and John Langford. An iterative method for multi-class costsensitive learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 3–11. ACM, 2004.
[11] Hyun-Chul Kim, Shaoning Pang, Hong-Mo Je, Daijin Kim, and Sung Yang Bang. Pattern classification using support vector machine ensemble. In Proceedings of the International Conference on Pattern Recognition, volume 2, pages 160–163. IEEE, 2002.
[12] Achmad Widodo and Bo-Suk Yang. Support vector machine in machine condition monitoring and fault diagnosis. Mechanical Systems and Signal Processing, 21(6):2560–2574, 2007.
[13] Aik Choon Tan, David Gilbert, and Yves Deville. Multi-class protein fold classification using a new ensemble machine learning approach. Genome Informatics, 14:206–217, 2003.
[14] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[15] William Fithian and Trevor Hastie. Local case-control sampling: efficient subsampling in imbalanced data sets. Annals of Statistics, 42(5):1693, 2014.
[16] Steve Webb, James Caverlee, and Calton Pu. Introducing the webb spam corpus: using email spam to identify web spam automatically. In Proceedings of the Third Conference on Email and Anti-Spam, 2006.
[17] Nathan Mantel and William Haenszel. Statistical aspects of the analysis of data from retrospective studies. Journal of the National Cancer Institute, 22(4):719–748, 1959.
[18] Norman Breslow. Design and analysis of case-control studies. Annual Review of Public Health, 3(1):29–54, 1982.
[19] James A Anderson. Separate sample logistic discrimination. Biometrika, 59(1):19–35, 1972.
[20] Gary King and Langche Zeng. Logistic regression in rare events data. Political Analysis, 9(2):137–163, 2001.
[21] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
[22] Alastair J Scott and CJ Wild. Fitting logistic models under case-control or choice based sampling. Journal of the Royal Statistical Society. Series B (Methodological), pages 170–182, 1986.
[23] AJ Scott and CJ Wild. Fitting logistic regression models in stratified case-control studies. Biometrics, pages 497–510, 1991.
[24] Alastair Scott and Chris Wild. On the robustness of weighted methods for fitting models to case–control data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(2):207–219, 2002.
[25] Christopher G Atkeson, Andrew W Moore, and Stefan Schaal. Locally weighted learning for control. In Lazy Learning, pages 75–113. Springer, 1997.
[26] Jerome Friedman, Trevor Hastie, Robert Tibshirani, et al. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics, 28(2):337–407, 2000.
[27] Andrea Vedaldi and Karel Lenc. Matconvnet: Convolutional neural networks for matlab. In Proceedings of the Annual ACM Conference on Multimedia Conference, pages 689–692. ACM, 2015.
We first state a lemma that will be useful for the proofs.
Lemma A.1. For some norm defined on the parameter space of
, assume that the quantities
are bounded. Then, for any compact set
Proof. For a fixed
Then, we have and
. By the Law of
Large Numbers, we know that converges point-wisely to
in probability. According to the assumption, there exists a constant M > 0 such that
Given any , we may find a finite cover
so that for any
, there exists
is finite, as
converges to 0 in probability. Therefore as
, with probability 1, we have
Let , we obtain the first bound. The second and the third bounds can be obtained similarly.
A.1 Proof of Theorem 4.1
Proof. For notational simplicity, throughout the proofs we will abbreviate the point-wise functions , respectively. Moreover, let
(1) Define and the convex function
The Bregman divergence of the convex function G(g) is
Since is convex, we have
Given any h(x), if we define
then is the global optimizer of
. It implies that,
The assumption of the theorem implies that the parameter space is compact and we have . Therefore, for any
It follows that given any , there exists
so that
implies that
. Now according to Lemma A.1, given any
, when
, with probability larger than
where for some
. Since
, we have
. By the continuous mapping theorem,
converges to
which will eventually converge to
in probability according to Lemma A.1. By the Slutsky’s Theorem, the limiting distribution of
is given by
Observe that is the sum of n i.i.d. random variables with mean
. The variance of
where
where the last equality is due to based on Eq. (3.2), and
This implies that we can rewrite
Next, we derive an explicit formula for
where
and
This indicates that . Hence, we have
This proves the desired result.
A.2 Proof of Theorem 4.2
Proof. Observe that the quantity is independent of the sampling probability a(x, k), and the dependence on a(x, k) in Eq. (4.1) comes from S. Therefore, to prove the desired bound of the variance, we only need to show
In order to prove this inequality, we only need to verify that for any vector
where
(1) Consider the first case in Theorem 4.2, where . Given x, we consider the following three sub-cases.
(1.1) Assume that . Denote the left side of Eq. (A.1) by F(a), and plug Eq. (4.5) into F(a), we obtain
where the inequality is obtained by the Cauchy-Schwartz inequality, which implies
Again, the above inequality is obtained by the Cauchy-Schwartz inequality.
(1.3) Assume that for all k = 1, . . . , K and hence q = 0.5. Under this case, we can
immediately obtain F(a) = 0. Combing (1.1), (1.2) and (1.3), we have shown that under the first case, where
(2) Consider the second case in Theorem 4.2, where . Given data point x, we consider the following three sub-cases.
(2.1) Assume that . By plugging Eq. (4.6) into F(a), we have
, and
(2.2) Assume that there exists some such that
. By plugging Eq. (4.6) intoF(a), we have
(2.3) Assume that for all k = 1, . . . , K and hence q = 0.5. Under this case, we can derive with the same way as above and obtain
Combing (2.1), (2.2) and (2.3), we have shown under the second case
The first case and the second case together imply that Eq. (A.1) holds, which proves
This leads to the desired variance bound in the theorem.
We now study the expected sample size.
(1) Consider the first case in Theorem 4.2, where : the conditional expectation of a(x, y) given x is
Therefore, the point-wise conditional expectation of the acceptance probability satisfies (2) Consider the second case in Theorem 4.2, where
: the conditional expectation of
Note that
Therefore, the point-wise conditional expectation of the acceptance probability satisfies
Combing both (1) and (2), we know that the expected number of accepted examples is
This completes the proof of Theorem 4.2.
A.3 Proof of Corollary 4.1
Proof. The proof follows the structure in proof of Theorem 4.1. By the mean value theorem, we have
where for some
. Since
, we have
. From the condition in the theorem,
. Therefore, by the continuous mapping theorem,
converges to
. Moreover, by Lemma A.1,
con- verges to
as
. Finally, by the Slutsky’s Theorem, the limiting distribution of
is given by
Following the proof of Theorem 4.1, we have
which is a constant that is independent of the pilot estimator
A.4 Proof of Proposition 4.1
Proof. The goal is to show that is the maximizer of
, the label
and Eqs. (2.3) and (3.3) can be rewritten as
where we write to indicate that the acceptance probability is computed based on the pilot estimator
. The first order and second order derivatives of
with respect to
Suppose . For the choice of
either in Eq. (4.5) or Eq. (4.6), we have
where the third equality holds by plugging in Eq. (A.4) and the definition of the acceptance probability in either Eq. (4.5) or Eq. (4.6). Since is the best estimator of the original MLE problem in Eq. (A.2),
. Therefore,
is the maximizer of
. The case of
can be derived similarly because of the symmetry. Given
is strictly concave on
because of the expectation over the population. With the assumption that
. Therefore,
is the unique maximizer of
It follows that given any , there exists
implies that
. Now according to Lemma A.1, given any
, with probability larger than
This implies that . So, the empirical estimator
converges to
A.5 Proof of Proposition 4.2
Proof. By following the proofs of Proposition 3 and Theorem 5 in [15], we can show that when
. In this proof, we focus on the analysis of the asymptotic variance. Let
where is some convex combination of
. Rearranging the equation, we obtain
From the condition in the proposition, and
from Proposition 4.1. Moreover,
and hence
. By the continuous mapping theorem,
converges to
which will eventually converge to
as
. By the Slutsky’s Theorem, the limiting distribution of
Define the cross partial derivative
By viewing as a function of
, we apply the Implicit Function Theorem to the fact
0 such that the implicit derivative
can be written as
Then, by using the Taylor expansion and noting that
for some norm . Rearranging the terms and as
We reach the desired result by combining the above with Eq. (A.5).
A.6 Proof of Proposition 4.3
Proof. The population version with respect to A is
Now, given x, suppose there exists a majority class i such that . Then, if the acceptance probability
is computed from a pilot that is
, for choice of
in Eq. (4.5), we have
where is the likelihood defined over the original population
is the best estimator of
under misspecification, the above quantity is 0. For choice of
in Eq. (4.6), we can have the same result. That is, the partial derivative of
with respect to the majority class i is always 0. Now, for class
, with choice of Eq. (4.5), we have
Generally, the partial derivative with respect to any non-majority class j with is not 0, which implies that the overall
is not the maximizer of
even if we choose
. So, the LUS estimator is not consistent to
for K > 2. Similar results can be derived for the choice of Eq. (4.6).
We have the following propositions quantifying the bias of the LUS estimator under model misspeci-fication.
Proposition B.1. Consider a misspecified model with K > 2. Let . With the following assumptions as considered similarly in the case of correctly specified model:
• the pilot estimator
and
are bounded by constant M for some norm
and k =
• there exists positive scalars
Then, if is sufficiently small, then
Proof. Let with normalized non-zero vector
. Then, by the second order mean value theorem, there exists a
Let and use the notations
and
to represent the probability estimate with
. The elements in the Hessian matrix
take the form
Therefore, we have
is negative-definite, and we have
Next, we show that if . The gradient with respect to the i-th class is
Thus, when , and we have
Proposition B.2. Assume the same conditions in Proposition B.1, with norm . As
where and
are constants, and
Proof. By the mean value theorem, there exists some linear combination for
where the second equality holds since is the maximizer of
. Suppose there exists a
that
as
. According to Lemma A.1 and the continuous mapping theorem,
. Then, by using Eq. (B.1),
We now derive the following upper bound:
With some abuse of notation, we will denote tively, and use the notations
to represent the acceptance probability and the probability estimate computed from
. The elements in the matrix
take the form
for any . Then, by the mean value theorem we can bound
and
for any j. Then, we have the following
Using a similar argument, we can have well. Therefore, we verify Eq. (B.2). Now, by using similar technique as in the proof of Proposition B.1, we can write
. Then, by the Woodbury matrix identity, we obtain
from which we can obtain the desired bound.
Proposition B.2 shows that the bias relies on which is the difference between the misspecified distribution
and the true distribution p. Given the misspecified model,
would be fixed and hence the bias of the LUS estimator should not be neglected. Fortunately, in the case that
to p, i.e., we can have a sufficiently small
, the difference between
and
is also bounded in Proposition B.2.
We adopt one of the state-of-the-art convolutional neural net structures provided in the MatConvNet library [27] to study the performance of different sampling methods. The net structures for obtaining the rough estimate and the LUS/US sampling based estimator are shown in Figs. 8(a) and 8(b), respectively, where the net structure for obtaining the rough estimate is much simpler than the one used for LUS/US based estimator to save computation. Moreover, for the simpler neural net, we set the training epoch, i.e., the effective passes over data, to be 10. For the net used for the sampling based estimators, we set the epoch to be a larger value 20.
Figure 8: Deep convolutional neural network structures used in the MNIST data.
We provide simulation results to show how the performance of the LUS estimator changes when we vary the amount of data used for computing the pilot estimate. We keep the settings as the same as used in Section 6.1 and Section 6.2. We change the portion of data used for the pilot estimate as
Figure 9: Performance of the LUS estimator when varying the amount of data for computing the pilot estimate under marginal imbalance.
Figure 10: Performance of the LUS estimator when varying the amount of data for computing the pilot estimate under marginal balance.
5%, 10% and 20%, and evaluate the LUS method.
Fig. 9 and Fig. 10 show the performance change of the LUS estimator in the simulations with marginal imbalance and marginal balance, respectively. In both of the simulations, we observe that the performance curves of the LUS method are similar that varying the amount of data for the pilot estimate does not affect the performance of the LUS estimator much and it generally needs a proportion of to achieve the same performance as that of the full-sample based MLE; when we increase
, the uniform sampling and case-control sampling methods also sample more data points and the superiority of the LUS method is not as obvious as the settings with fewer data for the pilot estimate. The results suggest that we may use a small number of data points (e.g., 5%) to obtain a rough pilot estimate, which is sufficient to let the LUS method perform well.