In this paper we study the problem of learning prediction models, such as regression and classification models, from incomplete data with missing values. This problem arises in a variety of fields ranging from traditional statistical analysis of empirical data in biomedical and social sciences to modern machine learning tasks such as natural language processing and web analytics.
Many statistical and machine learning approaches for handling missing values have been developed and explored in the literature [1,2]. The simplest way of dealing with missing values is to discard them. The obvious drawback of this naive approach is that the size of the training set decreases. Furthermore, discarding instances with missing values leads to serious biases in the prediction results unless certain simplified assumptions about the missing mechanism such
as the missing at random (MAR) assumption [2] are known to be met, which is
rarely the case in practice. Another common approach to handling missing values is single imputation, where missing values are simply populated with specific fixed values prior to prediction modeling. A variety of single imputation methods ranging from simple mean value imputation to advanced low-rank matrix
Figure 1: Schematic diagram of prediction uncertainty evaluation. (a) Given a training set with missing values, the goal is to evaluate the uncertainty in a prediction result. (b) MI constructs multiple (T > 1) training sets by randomly imputing values based on a probabilistic model, and combines the prediction results of T trained models. (c) The IPUB method represents the uncertainty in the missing entries as intervals, and estimates the prediction uncertainty by efficiently computing the lower and upper bounds of the prediction result.
completion [3–7] have been studied in the literature.
A principal concern with these commonly used approaches is that they cannot adequately evaluate the prediction uncertainty stemming from the uncertainty due to missing entries. The multiple imputation (MI) approach has been developed for evaluating prediction uncertainty [8–15]. The basic idea of MI is to introduce probabilistic models of true values in the missing entries. Using probabilistic models, multiple (T > 1) sets of values are sampled at random, and T complete training sets are synthetically constructed by imputing these values to the missing entries. Then, T prediction models are separately trained using these T training sets. The prediction uncertainty is evaluated by combining the prediction results of these T models. In MI, the performance of prediction uncertainty evaluations crucially depends on the choice of probabilistic models. Because the mechanisms of missing values are generally unknown, it is often difficult to select appropriate probabilistic models. Another difficulty with MI is that T must be sufficiently large to adequately evaluate the prediction uncertainty. When the size of the training set is large, the computational cost of training multiple model can be problematic.
In this paper, we propose a novel approach, called the Interval-based Prediction Uncertainty Bounding (IPUB) method, for evaluating the prediction uncertainty for learning prediction models such as regression or classification models with missing values. The IPUB method represents the uncertainty in the true value of a missing entry as an interval. For example, if there is a missing entry for the age of a person, the IPUB method represents the uncertainty in the missing entry, e.g., as “the missing age is in the interval [20, 40]”. Then, for a given test input, the IPUB method also represents the uncertainty in the prediction result in the form of an interval. For example, if the model is trained to predict the income of a person, IPUB evaluates the prediction uncertainty,
The IPUB method can be applied to a class of linear prediction models whose learning algorithm can be formulated as a certain form of penalized empirical risk minimization problem (see (1)) . This class contains many popular prediction model learning algorithms (see Table 1). The IPUB method has two advantages. The first is that it does not rely on probabilistic models of the true values of the missing entries. To evaluate the prediction uncertainty, the IPUB method computes lower and upper bounds for the predicted value when all possible training sets constructed by imputing arbitrary values in the intervals are considered. The second advantage is that, unlike MI methods, the IPUB method does not require multiple model training computations. Specifi-cally, using the IPUB method, a gauge of prediction uncertainty in the form of an interval can be computed only with the cost of training a single prediction model and O(M) additional computations, where M is the number of missing entries in the training set. Figure 1 illustrates the difference between the MI approach and the proposed IPUB method.
1.1 Related works and our contributions
The most popular MI approach is Bayesian MI [8–12]. In Bayesian MI, using observed entries in a training set, probabilistic models of the true values of the missing entries in the form of ) are constructed, where
is the set of unobserved values in the missing entries and
is the set of observed values in the training set. Bayesian MI methods proceed by generating multiple (T > 1) imputed training sets through randomly sampling
from
). Then, using each of the T training sets, T prediction models are trained, and the prediction uncertainty for a test input is modeled by combining the T prediction results of these T models.
The main contribution of this paper is to introduce a sensitivity analysis framework for evaluating prediction uncertainty. Sensitivity analysis has been used for analyzing the stability of solutions to optimization problems when some parameters are perturbed [16, 17]. Our basic idea is to compute how the prediction model (which is the optimal solution of a convex optimization problem) changes when the true values of the missing entries change. We developed a novel sensitivity analysis technique for efficiently computing the interval of prediction results when the true values in the missing entries change within the intervals. Our sensitivity analysis framework is motivated by the recent development of convex optimization techniques for sparse modeling [18–32].
In the numerical analysis literature, so-called interval analysis (IA) [33] has been studied, where the main goal is to evaluate the rounding error of floating point operations. In IA, the input and output for every operation is an interval. Therefore, some of the existing IA methods can be used similarly to the IPUB method for prediction uncertainty evaluation. However, the output intervals for IA tend to be much wider than required because IA computes intervals for every operation, i.e., when the number of operations is large, the final intervals are much wider than the optimal ones. In Section 4, we compare the IPUB method with an extant IA method [34], and demonstrate that the former can provide tighter bounds than the latter in most cases. In addition, the IPUB method is much faster than the IA method because interval computations must be repeated many times in the latter.
Notations For any natural number k, we define [k] := {1, . . ., k}. The -norm of
is denoted as
:= [
]
. For two vectors
, inequality notation such as
indicates elementwise inequalities,i.e.
. In addition, interval notation such as [v, u] indicates a set of elementwise intervals
. We also use these notations for matrices. For a condition c and two statements
and
, the notation (
) indicates “if c is true then
, otherwise
”. For a convex function
,
) := sup
is called the convex conjugate of
.
In this section we configure the problem and provide technical preliminaries for convex analysis.
2.1 Problem Setup
Consider learning a linear prediction model f that maps a d-dimensional input vector to a scalar output
where X and Y are the input
and output domain, respectively. The prediction model is written as
where is a monotonically non-decreasing function that maps the linear model output in R to the output in Y, and
is the vector of linear model parameters. For example, Y = R and g is the identity function in the case of least-squares regression,
and g is the sign function in the case of SVM, and Y = [0, 1] and g is the sigmoid function in the case of logistic regression
.
Let (X, y) be the training set, where X is an n-by-d input matrix and y is an n-dimensional output vector. The i-th row, the j-th column and the (i, j)-th element of X are respectively written as and
, and the i-th element of y is written as
for
] and
]. In this paper, we consider a situation where some of the elements in X are missing. The set of missing elements is denoted as
, and its size is written as M = |M|. We denote the true input matrix as
, and its i-th row, j-th column, (i, j)-th element are similarly written as
, respectively. Note that we cannot actually observe
, while the remaining elements of
are the same as the corresponding elements of X, i.e.,
=
if (
.
For now, consider a hypothetical situation where we can observe . We use a class of learning algorithms formulated as a penalized empirical risk min-
imization problem in the form of
where is a convex loss function, and
is a convex penalty function. The class (1) contains many well-known learning algorithms.
When the training input matrix X contains missing entries, we cannot obtain the optimal model parameters . A commonly taken heuristic approach for circumventing this difficulty is to construct a synthetic input matrix ˆX either by discarding instances that contain missing entries or by imputing fixed values to the missing entries, and then solve the problem in (1) by using ( ˆX, y) as the training set instead of (
). As discussed in Section 1, this heuristic approach completely ignores the effect of the uncertainty in missing entries.
As noted, in this paper, we represent the uncertainty in a missing entry as
an interval. For a missing entry (, let [
which the true value is located, where
possible value that takes, respectively. We also denote the uncertainty in
missing entries in the entire input matrix X as [X
element (
an arbitrary matrix such that all the elements satisfy
Our goal is to fully incorporate the uncertainty in missing entries for evaluating the uncertainty in prediction results. To this end, let us consider a set of
trained model parameters
Then, for a test input , the uncertainty in the prediction result is also
represented as an interval
where
Namely, the set of parameters in (2) represents the collection of all possible solutions of the penalized empirical risk minimization problems in (1) when the
input matrix is arbitrarily chosen from the interval [X
and
diction result when the uncertainty in missing entries is incorporated.
Unfortunately, it is not possible to obtain in practice because, by defini-tion, it requires us to solve infinitely many penalized empirical risk minimization problems. Our fundamental approach to tackling this difficulty involves devel-
oping a computationally feasible method that can efficiently compute
where )) and
)) are a lower bound of the smallest possible prediction result and an upper bound of the largest possible prediction result, respectively.
If these bounds are sufficiently tight, we can use them for evaluating the uncertainty in the prediction results.
2.2 Technical Preliminaries for Convex Analysis
Here, we present technical preliminaries for convex analysis which are used for constructing the proposed method in the next section. For a comprehensive introduction to convex analysis, see, e.g., [35]. In this subsection, for notational simplicity, we denote the input matrix as X, and assume that there are no missing entries in X.
Let us denote the objective function of the empirical risk minimization prob-
lem in (1) as
The method proposed in the next section can be applied to the problem in (1) when the loss function and the penalty function
satisfy certain conditions.
The dual problem of (1) is written, by using the convex conjugates and
, as
where domdenotes the domain of the function. See, for example, Corollary 31.2.1 in [36] for the details of dual problem derivation. Between the primal optimal solution
and the dual optimal solution
, the following relationships
are known to hold under certain regularity conditions:
The method proposed in the next section can be applied to any convex loss functions such as the squared loss function for regression, the hinge loss function for SVM, and the logistic loss function for logistic regression. See Table 1 for these loss function specifications, their convex conjugates, and their subderivatives.
On the other hand, the penalty function must satisfy the following properties.
-strongly convex decomposable penalty)
-strongly convex
Definition 1 includes the penalty, the elastic net penalty, and some other popular penalties. See Table 1 for these penalty function specifications, their convex conjugates, and their subderivatives.
In this section, we present our main results. Proofs of the theorems and the corollary in this section are all presented in Appendix A.
Given a test input , our goal is to evaluate the prediction uncertainty
in the form of an interval
Table 1: Examples of loss functions and penalty functions
for which the
proposed IPUB method is applicable
*1: [1, replace [
0] with [0
or 1, treat
0 are tuning parameters.
where )) and
)) are the lower bound of the smallest possible prediction result and the upper bound of the largest possible prediction result, respectively. Our approach with the proposed IPUB method involves the following two steps:
step 1. Compute a superset from X
step 2. Compute )) and
)) from W.
As described below, the computational cost of step 1 is the cost of solving a single penalized empirical risk minimization problem in (1) and O(M) additional computations where M is the number of missing entries. Once we compute a superset , the computational cost of step 2 is O(d) for each test input
3.1 Step 1
The pseudo-code for step 1 is presented in Algorithm 1. First, we select an
arbitrary input matrix from the interval, i.e.,
Next, we compute the primal and dual solutions by solving the penalized em- pirical risk minimization problem in (1) by using () as the training set, i.e.,
Then, we compute a superset as
where
The following theorem formally states that W in (11) is actually a superset of in (2).
Theorem 1. Suppose that the penalty function satisfies the properties in
let and
be the primal and dual solutions of the problem in (1) when the input matrix is
as defined in (10a) and (10b), respectively. Then, the set of model parameters W defined in (11) is a superset of
, i.e.,
.
Theorem (1) is based on the strong convexity (Definition 1) of the primal
objective function P. By strong convexity, for any
between two optimal solutions can be bounded from above by the difference between the objective function values
), which can be bounded by ∆. See Lemma 1 in Appendix A for details.
Next, we observe that, given the solution , the cost of computing the superset W in (11) is only O(M). Specifically, in the following theorem we state that ∆ in (12) can be computed with a cost of O(M) if some relevant quantities have already been computed.
Theorem 2. Suppose that the same condition as Theorem 1 holds. Furthermore, assume that the following values have already been computed:
Note that the assumption that the values in (13) have been precomputed is not problematic because the quantities in (13a) must have been computed when solving the optimization problem for computing and
, while the quantities in (13b) to (13e) can be known beforehand because they are merely lists of the positions in the missing entries. In addition, the memory required to store all these values is only O(n + d + M).
3.2 Step 2
Because the superset in step 1 is a sphere in
, it is straightforward to compute a lower bound
)) and an upper bound
)) as in the following corollary.
Corollary 1. Assume that a set of model parameters W is a superset of and is represented as a sphere with a center
and a radius
, then the prediction result
) for a test input
is guaranteed to be within
The cost of computing the prediction interval in (15) is O(d).
We explored the performance of the proposed IPUB method by comparing it with the Interval Newton (INewton) method [34]. We applied the two methods to three datasets taken from the UCI machine learning repository [37], where approximately 90% of each dataset was used for the training set, while the remaining 10% was used as the test set (see Table 2). Detailed setups are presented in Appendix B. For brevity, we only present results for the logistic regression in which we predict the probability of the class label being +1 by using the sigmoid function ) = 1/(1+exp(
)). We compared the performance of the two methods in terms of prediction uncertainty interval length and computation time. The former is evaluated as the difference between the upper bound and the lower bound of the predicted probability, i.e.,
)). All the computations were carried out with an Intel Xeon CPU (3.10GHz clock) and 256GB RAM.
4.1 Settings
Experimental Setting Parameters We considered various experimental settings by adjusting three parameters. The first parameter is the missingvalue proportion 001}; we randomly chose
elements of the input matrix and regard them as missing entries. The second parameter is the missing value interval length
; we set the interval for each missing element to [(1
2-quantile, (1 +
2-quantile] of the original values in the element. The third parameter is the penalty parameter
.
Table 2: Datasets used for experimentation. All are taken from the UCI Machine Learning Repository [37].
Implementation of IPUB In the implementation of Algorithm 1, is
chosen as (X
implemented in LIBLINEAR [38], and is computed from
by (8).
Interval Newton (INewton) Method We compared the performances of
the proposed IPUB method with an interval analysis method called the Interval
Newton (INewton) method [34]. In the INewton method, every operation in each Newton step is conducted for parameters and data represented as intervals. The computational cost of the INewton method is ), where K is the number of Newton iteration steps; the cost is proportional to
due to the matrix inverse operation in each Newton step.
4.2 Results
Prediction Uncertainty Interval Length Figure 2 shows results pertain-
ing to the prediction uncertainty interval length. Each plot contains a green
histogram of prediction uncertainty interval lengths )) by the IPUB method, and a purple histogram of the counterpart prediction uncertainty interval lengths obtained using the INewton method for each test input x. In most cases, the prediction uncertainty interval lengths associated with the IPUB method are much smaller than those for the INewton method. Because both methods guarantee that the predicted values are within the intervals, the
Table 3: The ratio of computation times of the IPUB method to those of the INewton method.
results testify to the superiority of the IPUB method.
Computation Time Table 3 lists the ratios of the IPUB to INewton computation times. In all cases, the IPUB method is more than 100 times faster than the INewton method.
In this paper, we proposed a new method for evaluating prediction uncertainty for learning with missing values. The proposed IPUB method is developed based on a novel sensitivity analysis technique for convex optimization problems, and can be applied to a wide class of commonly used machine learning algorithms. The IPUB method is associated with less prediction uncertainty and lower computational cost than existing methods.
This work was partially supported by MEXT KAKENHI (17H00758, 16H06538), JST CREST (JPMJCR1302, JPMJCR1502), RIKEN Center for Advanced Intelligence Project, and JST support program for starting up innovation-hub on
Figure 2: Normalized histograms of prediction uncertainty interval lengths obtained using the IPUB method (green) and the INewton method (purple) for test inputs in three datasets. Since both methods guarantee that the prediction values are within the intervals, shorter interval lengths indicate better performance.
materials research by information integration initiative.
[1] J. L. Schafer, Analysis of incomplete multivariate data. CRC press, 1997.
[2] R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data. Wiley, second ed., 2002.
[3] E. J. Cand`es and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009.
[4] E. J. Cand`es and T. Tao, “The power of convex relaxation: Near-optimal matrix completion,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2053–2080, 2010.
[5] R. H. Keshavan, A. Montanari, and S. Oh, “Matrix completion from a few entries,” IEEE Transactions on Information Theory, vol. 56, no. 6, pp. 2980–2998, 2010.
[6] J.-F. Cai, E. J. Cand`es, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM Journal on Optimization, vol. 20, no. 4, pp. 1956–1982, 2010.
[7] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization algo- rithms for learning large incomplete matrices,” Journal of machine learning research, vol. 11, no. Aug, pp. 2287–2322, 2010.
[8] D. B. Rubin, “Multiple imputations in sample surveys – a phenomenological bayesian approach to nonresponse,” in Proceedings of the survey research methods section of the American Statistical Association, vol. 1, pp. 20–34, 1978.
[9] M. A. Tanner and W. H. Wong, “The calculation of posterior distributions by data augmentation,” Journal of the American statistical Association, vol. 82, no. 398, pp. 528–540, 1987.
[10] R. J. Little, “Pattern-mixture models for multivariate incomplete data,” Journal of the American Statistical Association, vol. 88, no. 421, pp. 125– 134, 1993.
[11] S. Van Buuren and K. Oudshoorn, “Flexible multivariate imputation by mice,” tech. rep., TNO Prevention Center, 1999.
[12] T. E. Raghunathan, J. M. Lepkowski, J. Van Hoewyk, and P. Solenberger, “A multivariate technique for multiply imputing missing values using a sequence of regression models,” Survey methodology, vol. 27, no. 1, pp. 85– 96, 2001.
[13] G. King, J. Honaker, A. Joseph, and K. Scheve, “Analyzing incomplete political science data: An alternative algorithm for multiple imputation,” American political science review, vol. 95, no. 1, pp. 49–69, 2001.
[14] D. B. Rubin, “Nested multiple imputation of nmes via partially incompat- ible mcmc,” Statistica Neerlandica, vol. 57, no. 1, pp. 3–18, 2003.
[15] S. van Buuren, “Multiple imputation of discrete and continuous data by fully conditional specification,” Statistical methods in medical research, vol. 16, no. 3, pp. 219–242, 2007.
[16] A. Saltelli, K. Chan, E. M. Scott, et al., Sensitivity analysis, vol. 1. Wiley New York, 2000.
[17] D. P. Bertsekas, Nonlinear Programming. Athena Scientific, 1999.
[18] L. El Ghaoui, V. Viallon, and T. Rabbani, “Safe feature elimination for the lasso and sparse supervised learning problems,” Pacific Journal of Optimization, vol. 8, no. 4, pp. 667–698, 2012.
[19] J. Wang, J. Zhou, P. Wonka, and J. Ye, “Lasso screening rules via dual polytope projection,” in Advances in Neural Information Processing Systems, pp. 1070–1078, 2013.
[20] J. Wang, P. Wonka, and J. Ye, “Scaling svm and least absolute deviations via exact data reduction,” Proceedings of The 31st International Conference on Machine Learning, 2014.
[21] K. Ogawa, Y. Suzuki, and I. Takeuchi, “Safe screening of non-support vectors in pathwise svm computation,” in Proceedings of the 30th International Conference on Machine Learning, pp. 1382–1390, 2013.
[22] J. Liu, Z. Zhao, J. Wang, and J. Ye, “Safe Screening with Variational Inequalities and Its Application to Lasso,” in Proceedings of the 31st International Conference on Machine Learning, pp. 289–297, 2014.
[23] J. Wang, J. Zhou, J. Liu, P. Wonka, and J. Ye, “A safe screening rule for sparse logistic regression,” in Advances in Neural Information Processing Systems, pp. 1053–1061, 2014.
[24] Z. J. Xiang, Y. Wang, and P. J. Ramadge, “Screening tests for lasso problems.” arXiv preprint arXiv:1405.4897, 2014.
[25] O. Fercoq, A. Gramfort, and J. Salmon, “Mind the duality gap: safer rules for the lasso,” in Proceedings of the 32nd International Conference on Machine Learning, pp. 333–342, 2015.
[26] E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon, “Gap safe screening rules for sparse multi-task and multi-class models,” in Advances in Neural Information Processing Systems, pp. 811–819, 2015.
[27] J. Zimmert, C. S. de Witt, G. Kerg, and M. Kloft, “Safe screening for support vector machines.” NIPS 2015 Workshop on Optimization in Machine Learning (OPT), 2015.
[28] S. Okumura, Y. Suzuki, and I. Takeuchi, “Quick sensitivity analysis for incremental data modification and its application to leave-one-out cv in linear classification problems,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 885–894, 2015.
[29] A. Shibagaki, Y. Suzuki, M. Karasuyama, and I. Takeuchi, “Regularization path of cross-validation error lower bounds,” in Advances in Neural Information Processing Systems, pp. 1666–1674, 2015.
[30] K. Nakagawa, S. Suzumura, M. Karasuyama, K. Tsuda, and I. Takeuchi, “Safe pattern pruning: An efficient approach for predictive pattern mining,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1785–1794, ACM, 2016.
[31] A. Shibagaki, M. Karasuyama, K. Hatano, and I. Takeuchi, “Simultaneous safe screening of features and samples in doubly sparse modeling,” in International Conference on Machine Learning, pp. 1577–1586, 2016.
[32] H. Hanada, A. Shibagaki, J. Sakuma, and I. Takeuchi, “Efficiently evaluat- ing small data modification effect for large-scale classification in changing environment.” Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI-18), 2018.
[33] R. E. Moore, R. B. Kearfott, and M. J. Cloud, Introduction to Interval Analysis. Cambridge University Press, 2009.
[34] E. Hansen and G. W. Walster, Global optimization using interval analysis: revised and expanded. CRC Press, 2003.
[35] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
[36] R. T. Rockafellar, Convex analysis. Princeton university press, 1970.
[37] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007.
[38] R. R. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin, “LI- BLINEAR: A library for large linear classification,” Journal of Machine Learning Research, vol. 9, pp. 1871–1874, 2008.
We first present a lemma for strongly convex functions that can be used for finding a set of models when the data values are changed in the interval. These proofs are mainly based on convex optimization theory.
Lemma 1. Let f(v): : a convex set) be
-strongly convex. Let
:= argmin
Proof. The lemma is proved from the fact that 0 holds for any subdifferentiable convex function
for any convex domain S (see
B.24 in [17] for the proof). Then we have
argmin
we have
Therefore we can build a superset of as
Thus we know that W in (11) is a superset of . Note that the relationship between (16) and (17) can be proved from the following fact:
Proof of Theorem 2. In the setting of the theorem, the following (18) and (19) hold. Then we have (14) by ) =
) (
(9)) and Theorem 1.
We only present the proof of (18) here; the counterpart proof for (19) can
be similarly derived.
The expression (20) is rewritten as (21) (fewer terms in the summation) because,
if , then
The first term in (21) can be computed as that in (22) since ) is a onevariable convex function with respect to t; it is maximized at either end of the
domain of t, that is,
The fact that = min
proved.
The expression (23) is rewritten as (24) following a similar approach to (20) and (21) above.
Proof of Corollary 1. Because we assume g is monotonically non-decreasing, we
only have to prove
(25) is proved as follows. (26) is similarly proved.
Because the second term in the last expression is an inner product, it is minimized when and u is directed opposite to x, that is, u =
How to prepare datasets in Section 4 is as follows:
Separation of datasets into training and test sets For each of default
Outlier removal For each diversed numerical feature j, let be its 0.5-
Normalization For each feature, we normalized the values via a linear trans-