Lipschitz standardization for robust multivariate learning

Probabilistic learning is increasingly being tackled as an optimization problem, with gradient-based approaches as predominant methods. When modelling multivariate likelihoods, a usual but undesirable outcome is that the learned model fits only a subset of the observed variables, overlooking the rest. In this work, we study this problem through the lens of multitask learning (MTL), where similar effects have been broadly studied. While MTL solutions do not directly apply in the probabilistic setting—as they cannot handle the likelihood constraints—we show that similar ideas may be leveraged during data preprocessing. First, we show that data standardization often helps under common continuous likelihoods, but it is not enough in the general case, specially under mixed continuous and discrete likelihood models. In order for balance multivariate learning, we then propose a novel data preprocessing, Lipschitz standardization, which balances the local Lipschitz smoothness across variables. Our experiments on real-world datasets show that Lipschitz standardization leads to more accurate multivariate models than the ones learned using existing data preprocessing techniques. The models and datasets employed in the experiments can be found in https://github.com/adrianjav/lipschitz-standardization.


Figure 1: Marginals of continuous (left) and discrete (right) variables from the Adult dataset obtained from a trained VAE. Top to bottom: ground-truth, actual data; std, continuous variables were standardized; std-all, everything was standardized after replacing the discrete distributions by continuous approximations.

In the past few years gradient-based optimization approaches are becoming the gold standard for probabilistic learning. Representative examples of this trend include black box variational inference (BBVI) (Ran- ganath et al., 2014) and Variational Autoencoders (VAE) (Diederik et al., 2014). However, when such methods are applied to real-world datasets, one often encounters issues such as numerical instabilities.

As an illustrative example, we learn a VAE on the Adult

dataset from the UCI repository (Dua and Graff, 2017),

where every observation is represented by a set of twelve

mixed continuous and discrete variables, with heteroge-

neous data distributions (see Figure 1). As it is a common practice, we prevent numerical issues by standardizing the continuous variables prior to training the model. However, as shown in Figure 1, while the learned model does a reasonable job at fitting the continuous variable Final weight, it results in a poor fit of the discrete variable Occupation. Since discrete data seem cumbersome to work with, we then rely on a continuous approximation of these variables and standardize every variable to learn the VAE. Once the VAE is learned, we use the learned parameters to recover the parameters of the discrete likelihoods. In this case, illustrated in the bottom row of Figure 1, the VAE does a better job at capturing the Occupation but at the price of a poor fitting of the Final weight.

In order to understand the source of this issue, we need to dive deeper into the problem formulation. In short, the objective function of the VAE can be written as the sum of per-variable losses, i.e.,  L = �d Ld, and thus be interpreted as a multitask learning (MTL) problem–where different tasks (variables, in our case) compete for the model parameters during learning. In this context, previous work has shown that disparities in the gradient magnitudes across tasks, may determine which tasks the model prioritizes during training (Ruder, 2017). Due to the more restrictive nature of probabilistic learning, however, extant solutions from the MTL literature—e.g., GradNorm (Chen et al., 2018)—do not directly apply, as the likelihood would not integrate to one anymore.

Figure 2: Same setting as in Figure 1 where now ground-truth is the actual data and lip-all refers to the variable fit-tings obtained after preprocessing with Lipschitz standardization, fitting well every variable. In this paper, we rely on BBVI as showcase of gradient-based probabilistic learning to show that the solution resides in the data itself. Specifically, in Section 2.2, we first formalize the concept of balanced multivariate learning, which aims to ease that all the observed variables are learned at the same rate, and thus no variable is overlooked. In this context, we are able to study why data standardization often helps towards balanced learning when applied to common continuous likelihood

functions, such as the Gaussian distribution (Section 3). Unfortunately, as shown in our example above, this is not always the case. Then, based on our analysis, we propose Lipschitz standardization (Section 4), a novel preprocessing method that reshapes the data to equalize the local Lipschitz smoothness of the log-likelihood functions across all continuous and discrete variables. As illustrated in Figure 2, Lipschitz standardization facilitates a more accurate fitting by balancing learning across all variables.

Finally, we test Lipschitz standardization prior to learning different probabilistic models (mixture models, probabilistic matrix factorization, and VAEs) on six real-world datasets (see Section 5). Our results show the effectiveness of the proposed method which leads to a more balanced learning across dimensions, greatly improving the final performance across dimensions on most settings, being in the worst case as good as the best of the considered baseline preprocessing methods, including data standardization.

Let us assume a set of N observations  X = {xn}Nn=1, each with D different features  xn = {xnd}Dd=1. Follow- ing Hoffman et al. (2013), we consider that the joint distribution over the observed variables X, local latent variables  Z = {zn}Nn=1, and global latent variables  β, is given by the fairly simple—yet general—latent variable model p(X, Z, β) = p(β) �Nn=1 p(xn|zn, β)p(zn). To account for mixed likelihood models, we further assume that the likelihood factorizes per dimension as


where  ηnddenotes the likelihood parameters given by the latent variables  znand  βfor each  xnd, ηnd(zn, β).

Furthermore, we rely on BBVI (Ranganath et al., 2014) to approximate the posterior distribution over the latent variables,  p(Z, β|X). For simplicity in exposition, we assume a mean-field variational distribution family of the form  q(Z, β) = qγβ(β) �Nn=1 qγn(zn), where  {γn}Nn=1and  γβare respectively the local and global variational parameters. We denote by  γthe set of all variational parameters. BBVI relies on (stochastic) gradient ascent to find the parameters that maximize the evidence lower bound (ELBO),1 i.e.,


BBVI performs iterative updates over the variational (global and local) parameters of the form γt = γt−1 + α∇γL(X, γ, ϕ)where t is the current step in the optimization procedure. We further assume that the reparametrization trick (Diederik et al., 2014) can be applied on the latent variables (i.e.,  Z, β = f(γ, ε), where  εis a noise variable), such that the gradient of Eq. 2 can be computed as:


where we denote the log-likelihood  log pd(xd; ηd(γ))by  ℓd(ηd(γ)), making explicit the dependency of the log-likelihood evaluation to the variational parameters  γthrough the likelihood parameters  ηwhile making implicit its dependency with  xdand  ε.

A closer look to Eq. 3 shows that each dimension in the data contributes to the overall gradient computation in an additive way. Therefore, the gradient evaluation with respect to the shared parameters—and in consequence the learning process—can be monopolized by a small subset of dimensions if their gradients dominate this sum in Eq. 3. In other words, while the objective is to capture the joint distribution of all dimensions, differences in the gradient evaluation across different observed variables (e.g., Gaussian vs. multinomial) may result in a latent variable model that poorly fits a subset of the observed dimensions, as we already observed in the example of Section 1.

2.1 Connections with multitask learning

The gradient computation in Eq. 3—and the undesirable scenario described in the above—may result familiar to those readers knowledgeable about MTL literature. In MTL it is common to have a set of shared parameters  γwhose gradient are of the form  ∇γL = �d ∇γLd, where the sum is taken over all the tasks and each  Ldis the loss function of a particular task. When great disparities exist between task gradients during learning, the resulting model may poorly perform on some tasks, an effect attributed to the competition between tasks for the shared parameters and known as negative transfer (Ruder, 2017). Hence, the (variational) inference problem stated in Eq. 2 may also be interpreted as a (more restrictive) MTL problem where the input variables play the role of tasks, and the inference parameters are shared.

Given a set of fixed tasks, the most common approach in MTL is to tackle the previous problem using adaptive solutions (Chen et al., 2018; Kendall et al., 2018; Guo et al., 2018). These solutions add a set of weights to the loss function,  L = �d ωdLd, and dynamically change their value—based on different criteria—so that the magnitude of each task gradient  ∇γLdis comparable to the ones of other tasks.

Unfortunately, this type of solutions cannot be applied in the probabilistic setting since, as we mentioned before, we face a more restrictive problem. Specifically, by adding this set of weights in Eq. 2, we would also modify the likelihood in Eq. 1, which would no longer integrate to one as required.

2.2 Balanced multivariate learning

In variational inference, or more generally, in approximate Bayesian inference, we aim to accurately capture the posterior distribution of the latent variables explaining the joint distribution over all the observed variables, and not just a subset of them. Ideally, we want to follow a balanced multivariate learning process, where the normalized likelihood improvement per iteration t + 1 is the same for all dimensions, i.e.,


for d = 1, 2, . . . , D, where  γ0denotes the initialization of the variational parameters, and  Ctthe constant improvement at step t for all dimensions.

This is to the best of our knowledge the first time that balanced learning is properly defined, but its relevance has been acknowledge in prior MTL work (e.g., Eq. (6) of Milojkovic et al., 2019). Of special interest is GradNorm (Chen et al., 2018), an adaptive solution whose weights are tuned to “dynamically adjust gradient norms so different tasks train at similar rates”, including  ℓd(ηd(γt+1))/ℓd(ηd(γ0))in their formulation. Unfortunately, Eq. 4 turns out to be an unrealistic goal for the scope of this work.

To find a more feasible objective, we focus on the class of L-smooth functions, which is the broadest class of functions with convergence guarantees in gradient descent. A function  ℓ(γ)is L-smooth on Q with respect to  γ ∈ Qif it is twice-differentiable and, for any  a, b ∈ Q, it holds that:


For such class of functions, there exist theoretical results on the convergence rate to a critical point as a function of the Lipschitz constant L and number of steps T (Nesterov, 2018). Using our notation, this rate can be written as mint=1,2,...,T ||∇γℓd(ηd(γt))|| = O(�L/T). Note that this implies  ||∇γℓd(ηd(γt))|| → 0as  t → ∞, and in turn, ����∇γℓd(ηd(γt+1)) − ∇γℓd(ηd(γt))���� → 0. We can thus replace Eq. 4 by


which instead focuses on the difference between consecutive gradients to be proportionally equal across dimensions. Finally, assuming a good parameter initialization  γ0such that the initial gradient magnitudes are comparable across dimensions, we can consider constant the denominator from Eq. 6 as well. As a result, forcing every dimension to be L-smooth, i.e.,

����∇γℓd(ηd(γt+1)) − ∇γℓd(ηd(γt))���� ≤ L����γt+1 − γt����(7) turns out to be a weaker version of Eq. 6, whose goal is to ease a more balanced multivariate learning process.

In the following section, we study the impact of data standardization on the learning process. To this end, we show the relationship between the Lipschitz constants of the likelihood functions evaluated on the original and the standardized data. We then propose an estimator of the (local) Lipschitz constant, which allows us to show that, while data standardization may help, unfortunately in some cases is may counterproductive for balanced multivariate learning.

Preprocessing methods (e.g., standardization) are widely used in statistics and machine learning. However, there is a priori no way of deciding which one to use (Gnanadesikan et al., 1995; Juszczak et al., 2002; Milligan and Cooper, 1988). In distance-based machine learning methods, e.g. clustering, the effectiveness of these two methods can be readily understood since they bring all the data into a similar range, making the distance between points comparable across dimensions (Aksoy and Haralick, 2001). In other approaches, such as maximum likelihood or variational inference, the distance argument becomes less convincing,2 since explicit distance between observations is no longer evaluated. Another argument is that they usually improve numerical stability by moving the data, and thus the model parameters, to a well-behaved part of the real space. Since computers struggle to work with tiny and large values, this would have an inherent effect in the learning process.

In this section, we study the impact that dimension-wise data preprocessing, specifically scaling transformations of the form  �x = ωx, has on BBVI as an example of Bayesian inference methods based on first order optimization. We choose scaling transformations since: i) they preserve important properties of the data distribution, such as domain and tails; and ii) they are broadly used in practice (Han et al., 2011). Note that as shifting the data,  �x = x − µ, may violate distributional restrictions (e.g., non-negativity), we assume that the data may have been already shifted prior to the likelihood selection. Specifically, our main focus is on three broadly-used data scaling methods:

Standardization:  �xnd = xnd/stdd,Normalization:  �xnd = xnd/maxd,Interquartile range:  �xnd = xnd/iqrd,

where  stdd, maxd, and  iqrddenote the empirical standard deviation, absolute maximum, and interquartile range of the d-th dimension, respectively.

Next, we introduce a novel perspective on the effect of data scaling in inference methods based on first-order optimization. In a similar way as Santurkar et al. (2018) showed that batch normalization (Ioffe and Szegedy, 2015) smooths out the optimization landscape of the loss function, we show that data standardization often smooths out the log-likelihood optimization landscape in a similar way across dimensions. Importantly, by applying the chain rule to the gradient computation, i.e.,  ∇γℓ(η(γ)) = ∇ηℓ(η) · ∇γη, we can focus on the data-dependent part, the likelihood gradient  ∇ηℓ(η).3 In the following, we denote by �ℓd(�ηd) := log pd(�xd; �ηd)the likelihood function (with parameters �ηd) evaluated on the scaled data.

3.1 Scaling the exponential family


where  ηnd(zn, β)denotes the natural parameters parameretized by the latent variables, T(x) the sufficient statistics, h(x) is the base measure, and  A(η)the log-partition function. Note that both  ηand T(x) are vectors of size  Id. Working with the exponential family let us draw one really useful relation between scaled and original data: Proposition 3.1 (Simplified). Let  p(x; η)be a member of the exponential family where  x ∈ Rand  η ∈ RI. Besides, let us define  �x := ωxfor a given  ω ∈ R. Then, if every sufficient statistic can be factorized as  Ti(�x) = fi(ω)Ti(x)+gi(ω), the following holds:

∂j�ηi log p(�x, �η) = fi(ω)j ∂jηi log p(x; η),(9) where  ∂jηiand  ∂j�ηidenote the j-th partial derivative with respect to  ηiand  �ηi := ηi/fi(ω), respectively.

Table 1: First two columns: Multiplicative and additive noise (see Prop. 3.1) for some common distributions (parameterized for simplicity with the canonical parameters, instead of the natural ones). When  fior  giis omitted, it is assumed to be 1 or 0, respectively. Last two columns: L-smoothness of the scaled likelihood (parameterized by  �η1and  �η2) as a function of the original (canonical) likelihood parameters. Rat denotes a rational function, and  ψ(1)the trigamma function.


A more complex version of the proposition and its proof can be found Appendix C. Although the proposition’s requirements may look restrictive at first, as reported in Table 1, many commonly-used distributions fulfil such properties. It also is worth-mentioning that in the case of the log-normal distribution we consider the scaling function  �x = xω, instead of  �x = ωx.

Assume now that  ℓ(η)is  Li-smooth with respect to its i-th natural parameter,  ηi. Using Proposition 3.1, we obtain the Lipschitz constant of the scaled likelihood �ℓd(�ηd)as a function of the original one  ℓ(η), i.e.,


where  �a, �b ∈ RIare two different (scaled) parameters and the last expression is a result of the Cauchy-Schwarz inequality. Assuming the 1-norm, this implies that the scaled log-likelihood �ℓ(�η)is �Li-smooth with respect to  �ηi, with


3.2 “Standardizing” the optimization landscape

In order to quantify the L-smoothness of a function, we need to compute its Lipschitz constant. As we are considering here data scaling transformations, i.e., a preprocessing step, we focus on the local L-smoothness around the empirical estimation of the natural parameters, denoted by  �η. As an example, assuming a Gaussian variable with empirical mean and standard deviation denoted by  �µand  �σ, then  �η1 = �µ/�σ2and  �η2 = −1/2�σ2.

Unfortunately, calculating the (ε-local) Lipschitz constant may be challenging, as it involves solving


where  B(�η, ε)is the ball with radius  εand centered in the empirical estimation of the natural parameters  �η. Instead, we here rely on an estimator of  Li, which is derived by taking the limit  ε → 0and making use of the multivariate mean value theorem.

Theorem 3.1 (Mean Value Theorem). Let  ℓ(η)be a twice-differentiable real-valued function with respect to  ηi ∈ ηon  Q ⊂ RI. Then, for any two values  a, b ∈ Q, there exists  c ∈ Qsuch that


By taking norms above and applying the Cauchy-Schwarz inequality we obtain the same inequality as in Eq. 5, ����∂ηiℓ(a) − ∂ηiℓ(b)���� ≤����∇η∂ηiℓ(c)���� · ||a − b||. Setting  c = �η, we obtain our local estimator of the Lipschitz constant as:


Importantly, if  ℓis  Li-smooth for each  ηiin the set of natural parameters  η, then it is �i Li-smooth with respect to  η. Similarly, if  ℓ1is  L1-smooth and  ℓ2 L2-smooth, then  ℓ1 + ℓ2is  (L1 + L2)-smooth.4 These properties are proved in Appendix B.

Moreover, for the distributions considered in Table 1, we can use our estimator to approximate the resulting L-smoothness after standardizing the data (details in Appendix E). These results shed some light on why standardizing works well in many settings, since it makes the L-smoothness comparable across dimensions for several common likelihood functions. Specifically, i) the exponential and Rayleigh distributions have constant (local) L-smoothness; ii) a centered (log-)normal distribution is 3-smooth; and iii) the Gamma distribution is (approximately) 1-smooth as long as its shape parameter  α(which is scale-invariant, i.e.,  ˜α = α) is sufficiently large. However, Table 1 also showcases that for other likelihood the resulting Lipschitz constants may not be comparable. This is the case for the inverse Gaussian (Gamma) distribution, whose Lipschitz constants after standardizing are rational functions of  µ(of  α) that can be arbitrarily large or small.

In the previous section we observed that the Lipschitz constant after scaling the data, �Li(ω), can be seen as a function of the scaling factor  ω. As a consequence, it should be possible to find an  ωthat eases balanced multivariate learning by making all the dimensions in the data share the same Lipschitz constant. In this section, we propose a novel data scaling algorithm with this same goal in mind, Lipschitz standardization. Intuitively, our algorithm puts the data into a region of the parameter space where the local L-smoothness is comparable across all dimensions.

Given a single L-smooth function  ℓ(γ), it can be shown that there exists an optimal step size  α∗ = 1/Lfor first-order optimization (Nesterov, 2018). However, when we aim to jointly fit multiple functions, in our case log-likelihood functions  {ℓd(ηd(γ))}Dd=1, each one being  Ld-smooth, the optimal learning rate for each individual likelihood is different, although the parameters (in our case, the variational parameters  γ) that we optimize are shared. Importantly, while there exists an optimal learning rate for the overall likelihood function  ℓ(γ) = �d ℓd(ηd(γ)), it may still lead to an unbalanced learning process, and thus, to inaccurate fitting of the data.

The proposed Lipschitz standardization scales each d-th dimension using the weight  ω∗d, obtained such that all dimen- sions share a similar Lipschitz, i.e.,


where �Ldi(ωd)are the scaled Lipschitz constants, as in Eq. 11, and  L∗the target L-smoothness. In our experiments we set  L∗to  1/(Dα), where  αis the initial learning rate set by the practitioner. The motivation behind this choice is approximating the resulting overall likelihood ˜L-smoothness to the one optimal for a given learning rate, being ˜L = �d ˜Ld ≈ �d 1/(Dα) = 1/α.

Remark 1. In our experiments, we use Proposition 3.1 and automatic differentiation to approximate the local L-smoothness, as well as closed-form solutions and root-finding methods to find the optimal scaling factors  ω∗d(details in Appendix D). However, we recall that gradient descent may be also used to solve the optimization problem in Eq. 14. As a result, Lipschitz-standardization is applicable to other log-likelihood functions than the ones discussed above, as well as for different problems beyond BBVI.

Remark 2. Our algorithm is a preprocessing step, and thus the Lipschitz standardized data  �x, as well as the scaled likelihood functions �ℓd(�ηd), are used to learn the model parameters (the variational parameters, in our case). However, during test and deployment, we ought come back to the original space of the data. This can be done, in the case of distributions in the exponential family (see Section 3.1) by using Prop. 3.1, which shows how to obtain the parameters of the original likelihood function as  η = f(ω) ⊙ �η. Appendix A briefly sketches this idea, providing examples on how our approach applies to the distributions in Table 1 and to discrete data, which we discuss next.

4.1 Discrete data

Up to this point, our algorithm only applies to continuous data and likelihood functions. However, real-world data often present mixed continuous and discrete data types, as well as likelihood models. Next, we extend the proposed


Figure 3: Missing imputation error across different datasets and models (lower is better). Each method appears only when applicable and it is shown in the same order as in the legend.

Lipschitz-standardization method to discrete data (represented using the natural numbers), assuming discrete distributions such as Bernoulli, Poisson and categorical distributions. We refer to this new approach as Gamma trick.

Gamma Trick. This approach (detailed in Appendix A) can be summarised in four steps: i) transform the discrete data x to continuous x via additive noise, i.e.,  x = x+ε, for which we assume a Gamma likelihood; ii) apply Lipschitz standardization to x to ease more balanced learning; iii) apply the learning process on the scaled data  �xto learn the model parameters  �η; and iv) estimate the parameters of the original discrete distribution using the learned (un-)scaled continuous distribution.

Recovering the parameters of the discrete likelihood. The Bernoulli and Poisson distributions are characterized by their expected value. Hence, to recover their distributional parameters for testing, it is enough to do mean matching between the original distribution and its (un-scaled) Gamma counterpart. Note that the mean of the discrete variable x is given by  µ = µ − E [ε], where  µis the mean of x, i.e.,  α/βunder the (un-scaled) Gamma distribution with parameters  αand  β. Therefore, we estimate the mean of the Bernoulli distribution as  p = max(0, min(1, µ)), and the rate of the Poisson distribution as  λ = max(δ, µ), where  0 < δ ≪ 1to ensure that  λis positive.

As the categorical distribution has more than one parameter, a Bernoulli trick is applied before applying the Gamma trick. The Bernoulli trick assumes a one-hot representation of the K-dimensional categorical distribution and treat each class as an independent Bernoulli distribution, which as shown above is suitable for the Gamma trick. To recover the parameter of the categorical distribution  π = (π1, π2, . . . , πK)we individually recover the mean of each Bernoulli class,  µk, and make sure that they sum up to one, i.e.,  πk = µk/�Ki=1 µifor k = 1, 2, . . . , K. Note that, when applying Lipschitz standardization to the categorical distribution, we account for the fact that it has been divided in K Gamma distributions. As we want all the observed dimensions to be  L∗-smooth, we group up the new K Gamma distributions and set their objective L-smoothness to  L∗/K, so that they add up to the same L-smoothness, i.e., �kŁk = L∗.

Additive noise. In our transformation from discrete data into continuous data,  x = x+ε, we ensure that the continuous noise variable  ε: i) lies in a non-zero measure subset of the unit interval  ε ∈ (0, 1)so that the original value is identifiable; ii) preserves the original data shape as much as possible; and iii) ensures that the shape parameter  αof the Gamma is far from zero, and  L1does not become arbitrarily large (see Appendix E for further details). In our experiments we use noise  ε ∼ Beta(1.1, 30).

Experimental setup We use six different datasets from the UCI repository (Dua and Graff, 2017) and apply BBVI to fit three generative models: i) mixture model; ii) matrix factorization; and iii) (vanilla) VAE. Additionally, we pick a likelihood for each dimension based on its observable properties (e.g., positive real data or categorical data) and, to provide a fair initialization across all methods and datasets, continuous data is standardized beforehand. Appendix F contains further details and tabular results.

Methods. We consider different combinations of continuous and discrete preprocessing, taking them in our naming nomenclature as prefix and suffix, respectively. Specifically, for continuous variables we use: i) std, standardization; ii) max, normalization; iii) iqr, divides by the interquartile range; iv) lip, Lipschitz standardization. And similarly we consider for discrete distributions: i) none, leaves the discrete data as it is; ii) bern, applies the Bernoulli trick to


Figure 5: Per-dimension normalized error on the Adult dataset. Top row: Matrix factorization. Bottom row: VAE. Note that all methods but lip-gamma overlook a subset of the variables.

categorical data; iii) gamma, applies the Gamma trick to all discrete variables. As an example, the proposed method applies the Gamma trick to the discrete variables, and then Lipschitz standardizes all the data, so that it is denoted as lip-gamma.


Figure 4: Per-dimension normalized error for different models on the Letter dataset. Dotted line represents the baseline. Values closer to the origin are better.

Metric. Analogously to Nazabal et al. (2018), we evaluate the performance of the methods in a data imputation tasks using average missing imputation error as evaluation metric. Specifically, normalized mean squared error is used for numerical variables and error rate for nominal ones. Besides, in Figures 4 and 5, we show the imputation error, normalized by the error obtained by mean imputation, for each dimension.

Results. Figure 3 summarizes the results averaged over three settings with 10, 20, and 50 % of missing values— with 10 independent runs each—where outliers were removed for better visualization (more detailed results can be found in Appendix G). We can distinguish two groups. The first group corresponds to the methods that leave discrete data untouched, where we observe that the Lipschitz standardization (lip-none) provides comparable results to the best of its counterparts (max-none, std-none, iqr-none), being worth-mentioning the results of matrix factorization in defaultCredit, where std-none and iqr-none completely disappear from the plot after removing outliers. Clearly, the second group of methods, which handle discrete variables using either the Bernoulli or Gamma trick, outperform the former group. This becomes particularly clear on highly heterogeneous datasets (e.g., defaultCredit and Adult), where we obtain—and occasionally beat—state-of-the-art results reported by Nazabal et al. (2018).

We remark that, while results across models are consistent, the effect of data preprocessing directly depends on the model capacity and dataset complexity. Specifically, the mixture model is too restrictive, finding the same optimum regardless of the preprocessing; matrix factorization has enough capacity to be greatly affected by the data (as shown in Figure 3); and the VAE is as powerful as to overcome most of the differences in the preprocessing for simpler datasets, yet still being affected for the most complex datasets. This is nicely exemplified in Figure 4, which shows per-dimension normalized error on the Letter dataset, where we clearly observe the benefits of both the Bernoulli and Gamma tricks.

Last but not least, the advantage of using Lipschitz-standardization, i.e. lip-gamma, compared with the other two competitive methods, lip-bern and std-gamma, comes in the form of more consistent results for all datasets and independent runs, due to a more balanced learning. This can be easily seen by analyzing the per-dimension error of the most complex datasets—see Figure 5—where lip-gamma improves the overall imputation error across tasks without completely overlooking any variable. On the other hand, both lip-bern and std-gamma overlook four different variables on the Adult dataset using two different models. This behavior is not exclusive of Adult, as Figures 10-11 and Tables 4-6 in Appendix F show. To tie everything up, we would like to point out that the illustrative example given in Section 1 (Figures 1-2) corresponds to a particular run from the bottom row.

In this work we have introduced the problem of balanced multivariate learning, which occurs when first-order optimization is used to perform approximate inference in multivariate probabilistic models, and which can be seen as a MTL problem. Then, since existing solutions for MTL problems do not seem to directly apply in the probabilistic setting, we have instead focused on data preprocessing as a simple and practical solution to mitigate unbalanced learning. In particular, we have shed new insights on the behaviour of data standardization, finding that it makes the smoothness of common continuous log-likelihoods comparable. Finally, we have proposed Lipschitz standardization, a data preprocessing algorithm that eases balanced multivariate learning by making the local L-smoothness equal across all (discrete and continuous) dimensions of the data. Our experiments show that Lipschitz standardization outperforms existing methods, and specially shines when the data is highly heterogeneous.

Interesting research avenues include the implementation of Lipschitz standardization in probabilistic programming pipelines, its use in settings different from BBVI (e.g., HMC), and extending this idea to an online algorithm embedded in the learning process, which takes the model into consideration and enables the fine-tune of the local Lipschitz during learning.

Aksoy, S. and Haralick, R. M. (2001). Feature normalization and likelihood-based similarity measures for image retrieval. Pattern recognition letters, 22(5):563–582.

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877.

Chen, Z., Badrinarayanan, V., Lee, C.-Y., and Rabinovich, A. (2018). Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pages 794–803. PMLR.

Diederik, P. K., Welling, M., et al. (2014). Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), volume 1.

Dua, D. and Graff, C. (2017). UCI machine learning repository.

Gnanadesikan, R., Kettenring, J. R., and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis. Journal of Classification, 12(1):113–136.

Guo, M., Haque, A., Huang, D.-A., Yeung, S., and Fei-Fei, L. (2018). Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 270–287.

Han, J., Pei, J., and Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347.

Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Jang, E., Gu, S., and Poole, B. (2016). Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144.

Juszczak, P., Tax, D., and Duin, R. P. (2002). Feature scaling in support vector data description. In Proc. asci, pages 95–102. Citeseer.

Kendall, A., Gal, Y., and Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482– 7491.

Milligan, G. W. and Cooper, M. C. (1988). A study of standardization of variables in cluster analysis. Journal of classification, 5(2):181–204.

Milojkovic, N., Antognini, D., Bergamin, G., Faltings, B., and Musat, C. (2019). Multi-gradient descent for multi- objective recommender systems. arXiv preprint arXiv:2001.00846.

Nazabal, A., Olmos, P. M., Ghahramani, Z., and Valera, I. (2018). Handling incomplete heterogeneous data using vaes. arXiv preprint arXiv:1807.03653.

Nesterov, Y. (2018). Lectures on convex optimization, volume 137. Springer.

Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. Artificial Intelligence and Statistics, pages 814–822.

Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.

Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help optimization? In Advances in Neural Information Processing Systems, pages 2483–2493.

It is important to bear in mind the transformation the data follows during the training procedure, as well as what we do with the data at each phase. To clarify this in our setting, we provide in Figure 6 two diagrams describing this procedure for continuous and discrete variables, following the notation of the main paper. As a summary, data is transformed and scaled, and the scaled natural parameters are learned during training. Whenever evaluation is needed, these parameters are always returned to the space of the original data, that is,  �ηis transformed to  ηbefore evaluating on the space of x.


Figure 6: Schematic working flow used in this work. For training, data is transformed and their natural parameters are learned. To evaluate, the original parameters are recovered from the transformed ones.

To avoid confusion, let us clarify here what are the transformations described in Figure 6b (the continuous case is included as a special case). The step  xd �→ xdrefers to all the transformations regarding discrete data explained in Section 4.1 of the main paper. Specifically, splitting a categorical variable into K independent Bernoulli ones in the case of the Bernoulli trick, and the addition of noise in the case of the Gamma trick. The transformation  xd �→ �xdrefers to the data scaling procedure: standardization, normalization, Lipschitz standardization, etc. The orange arrow is the process performed by the model, which takes the input  �xdand outputs the parameters  �ηd. Then, in  �ηd �→ ηd, the parameters are scaled back to their original size, using the relationship between natural parameters described in Proposition 3.1 of the main paper. We do the transformation  ηd �→ ηdas described in Section 4.1 of the main paper, that is, removing noise, clipping, and gathering the K independent parameters into a dependent one as necessary. Finally, we can use those parameters  ηdto evaluate the data coming from the same source as the original data.

Something we have not discussed in the main paper regards the choice of the Gamma distribution as a proxy to learn the parameters of the Bernoulli and Poisson distributions. As counter-intuitive as it might seem at first, it turns out that the Gamma distribution is a great distribution for doing mean matching with respect to these distributions. To check this statement, we have run a simple Python code using scipy.stats that: i) generates random samples from a Bernoulli (Poisson) distribution; ii) adds additive noise from a distribution Beta(1.1, 30); iii) fits the data to a Gamma distribution and performs mean matching as explained before; and iv) computes the mean absolute difference between the estimated and real parameters. This procedure was performed for Bernoulli distributions with parameter p = i/50, and Poisson distributions with parameter  λ = iand  λ = i/50for i = 0, 1, . . . , 50. The average error obtained was 0.0081 and 0.0712 for the Bernoulli and Poisson distributions, respectively.

A.1 Illustrative example of data workflow

We provide a simple example that shows how data is transformed and used throughout the entire process. Assume that we have two input dimensions, D = 2, whose distributions are assumed to be normal  X1 ∼ N(µ, σ)and categorical with 3 classes  X2 ∼ Cat(π = (π1, π2, π3)), respectively. Let us further suppose that we want to use lip-gamma, that is, Lipschitz-standardization combined with the Gamma trick. Then, we would not alter the first variable X1 = X1 ∼ N(µ, σ), but substitute  X2with  X2j = X2j + εj ∼ Γ(αj, βj), where j = 1, 2, 3 are the indexes of the new variables,  X2j ∼ Bern(pj)refers to the j-th element of  X2when considered its one-hot representation, and εj ∼ Beta(1.1, 30)is the (independent) additive noise variable.

Now, we can scale transform all variables, thus obtaining the new scaled variables �X1 = ω1X1 ∼ N(�µ, �σ)and �X2j =ω2jX2j ∼ Γ(�α, �β)for j = 1, 2, 3. After training—or whenever we need to evaluate the model in non-training data— we ought to return to the original probabilistic model  X1, X2. When recovering the X variables, we need to use Proposition 3.1 so that  ηi = fi(ω) ⊙ �ηi, where we have obtained  �ηias the output of our model.

To finally recover the original variables,  X1, X2, we do not need to do anything to  X1since  X1 = X1. For the second variable, we obtain  X2j ∼ Bern(pj)as



Proposition B.1. If a real-valued function  ℓ(η)is  Li-smooth with respect to  ηi, the i-th parameter of  η ∈ RI, for all i = 1, 2, . . . , I, then  ℓis �i Li-smooth with respect to  η(assuming the 1-norm).

Proof. Consider two arbitrary  a, b ∈ RI. Then, by assumption,  |∂ηiℓ(a)−∂ηiℓ(b)| ≤ Li||a − b||for i = 1, 2, . . . , I and


Proposition B.2. If two real-valued functions  ℓ1(η)and  ℓ2(η)are  L1-smooth and  L2-smooth with respect to  η, respectively, then  ℓ1 + ℓ2is  L1 + L2-smooth with respect to  η.

Proof. Consider two arbitrary  a, b ∈ RI. Then,


As stated in the main paper, the exponential family is characterized for having the form


where  ηndare the natural parameters, T(x) the sufficient statistics, h(x) is the base measure, and  A(η)the log-partition function.

To ease the task of transforming between natural (η) and usual (θ) parameters, we provide in Table 2 a cheat-sheet with the relationship between them for the distributions used along the paper, as well as the way that natural parameters are scaled with respect to the scaling factor  ω.

Regarding the relation between scaled and original data in the exponential family, we now prove a more general version of Proposition 3.1 from the main text.

Proposition C.1. Let  p(x; η)be a density function of the exponential family where  x ∈ X ⊂ Rand  η ∈ Q ⊂ RI. Assume a bijective scaling function  �x : X × R+ → Xsuch that for any  ω ∈ R+it defines the function (and random variable)  �xω = �x(x, ω). If all sufficient statistics factorize as  Ti(�xω) = fi(ω)Ti(x) + gi(ω), then by defining  �ηsuch that  η = f(ω) ⊙ �η, where  f = (f1, f2, . . . , fI)and  ⊙is the element-wise multiplication, we have


where  ∂j�ηidenotes the jth-partial derivative with respect to  �ηi.

Proof. First we are going to relate the normalization constants  A(�η)and  A(η)of  log p(�xω; �η)and  log p(x; η), respectively:


Table 2: Relationship between parameters  θand natural parameters  η, as well as the way the latter scale (see Proposition 3.1 of the main text) for different distributions of the exponential family.


We can safely divide by h(x) since it is the Radon-Nikodym derivative dH(x)dxand we can assume that is non-zero almost everywhere in the domain of the likelihood.


By denoting  ϕ(x, ω)everything that is not  p(x; η)in the previous equation we have that:


Now, for the case j = 1 we just have to use the chain rule and the fact that  ϕ(x, ω)does not depend on  ηi:


And we can just prove the case j > 1 by induction:


In this section we show some results on how to find the optimal scaling factor  ωdsolving the problem described in Equation 14 of the main paper. For completeness, let us recall the problem:


where �Ldis the Lipschitz constant corresponding to the L-smoothness of the scaled d-th dimension, and  L∗ > 0is the smoothness goal that we attempt to achieve (as described in the main text).

For common distributions we are able to give some guarantees. Specifically, we can obtain closed-form solutions for the exponential and Gamma distributions, whereas for the (log-)normal distribution we prove the existence and uniqueness of the optimal  ωd.

Remark We use throughout the proofs the well-known result that  ∂ηiA(η) = E [Ti(x)]for any i = 1, 2, . . . , I in the case of the exponential family. Therefore,  Li = �j ∂ηj∂ηi log p(x; η)can be rewritten as Li = �j ∂ηj Eη [Ti(x)] = �j ∂ηi Eη [Tj(x)], where the last equality is a direct consequence of Young’s theorem.

Proposition D.1 (Exponential distribution). Let  X ∼ Exp(λ)and  X = {xn}Nn=1. Suppose that, for some value  �η, it holds that  log p(X; �η)is  Li-smooth w.r.t.  ηi ∈ ηfor i = 1. Then the solution for problem 23 always exists, is unique, and can be written as


Proof. The minimum of problem 23 happens when �Ii=1 �Li = L∗. In this particular case, when  �L1 = L∗. As show in Equation 10 from the main paper, we know that �Li(ω) = |fi(ω)| �j|fj(ω)|Lifor the 1-norm. In our particular


To show that  ω∗always exists we only have to show that  L1 > 0in all cases, which can easily shown:


and  L1 = |∂2η1| = η−21 > 0since  η1 > 0by definition.


Proposition D.2 (Gamma distribution). Let  X ∼ Γ(α, β)and  X = {xn}Nn=1. Suppose that, for some value  �η, it holds that  log p(X; �η)is  Li-smooth w.r.t.  ηi ∈ ηfor i = 1, 2. Then the solution for problem 23 exists if  L∗ > L1, is unique, and can be written as


Proof. As in the exponential case, we want to solve the equation �L1(ω) + �L2(ω) = L∗.


Therefore we need to find the roots of the polynomial  L2ω2 + (L1 + L2)ω + L1 − L∗ = 0. To find the roots, let us denote the discriminant as  ∆ = (L1 + L2)2 − 4L2(L1 − L∗). Note that we can simplify  ∆:


The roots  ωare given by


If  L∗ > L1we can again show that the solution always exists by computing  L2:


Proposition D.3 (Normal distribution). Let  X ∼ N(µ, σ2)and  X = {xn}Nn=1. Suppose that, for some value  �η, it holds that  log p(X; �η)is  Li-smooth w.r.t.  ηi ∈ ηfor i = 1, 2. Then the solution for problem 23 always exists, is unique, and can be expressed as the unique positive root of


Proof. First, note that  L2is always positive. To show that we calculate it approximation once again:


We have that  L2 > 0since the second term is only zero when  µ = 0and, if that is the case,  η1 = 0and the first term is positive.


This is equivalent to finding the positive roots of  Q(ω) = L2ω4 + (L1 + L2)ω3 + L1ω2 − L∗. Then let us call P(ω) = L2ω4 + L1ω2so that  Q(ω) = P(ω) + (L1 + L2)ω3 − L∗.

Note that there exists a unique positive solution of the equation  P(ω) = Giwith  Gi > 0. In fact, the only positive root of  L2ω4 + L1ω2 − Giis


Define  G0 = L∗. As just pointed out, there exists a unique  ω1 > 0such that  P(ω1) = G0. Then


since  G1 < G0, the discriminant of Equation 27 is smaller in the case of  G1and thus  ω2 < ω1.


We can now find ω2< ω3< ω1such that P(ω3) = G2, Q(ω3) = (L1 + L2)(ω33 − ω32). Note that ω31> ω33 ⇒ω31 + ω32> ω33 ⇒ ω31> ω33 − ω32, meaning that Q(ω3)< Q(ω1).

Thus far, we have built a sequence such that  Q(ω2) < 0 < Q(ω3) < Q(ω1). If we follow the process and define G3 = G2 − (L1 + L2)(ω33 − ω32)we will find an  ω2 < ω4 < ω3such that  Q(ω2) < Q(ω4) < 0 < Q(ω3) < Q(ω1).

Finally, let us define the sequence of intervals  Ii = [Q(ωi+1), Q(ωi)]for  i = 1, 2, . . . , ∞constructed using the described procedure. This sequence is a strictly decreasing nested sequence of non-empty compact subsets of R. Therefore, Cantor’s intersection theorem states that the intersection of these intervals is non-empty,  ∩iIi ̸= ∅, and since the only element which is in all the intervals is  0, ∩iIi = {0}.

The sequence  {Q(ω2i)}∞i=1({Q(ω2i+1)}∞i=1) converges to 0 since it is a strictly decreasing (increasing) sequence lower-bounded (upper-bounded) by 0. The sequences of their anti-images,  {ω2i}∞i=1and  {ω2i+1}∞i=1, converge then to the same value,  ω∗, the root of Q and the solution of problem 23.


E.1 L-smoothness after standardization

Similar to what we have done in Appendix D, here we are going to compute the estimator of the local L-smoothness for some usual distributions using Ł  = �iŁi = �i�j|∂ηj∂ηi log p(x; η)|, and then see how this smoothness changes as we scale by  ω = 1/std. We will use here the standard deviation expression of each particular likelihood, therefore these results hold as long as the selected likelihood properly fits the data.

(Log-)Normal distribution First, we compute the partial derivatives of the log-likelihood:


Therefore, we have that  L1 ≈ σ2 + 2|µ|σ2and  L2 ≈ 2σ2(|µ| + σ2 + 2µ2). After standardizing the data, we have that �µ = µ/σand  �σ2 = 1, resulting in �Lstd1 = 1 + 2 |µ|σand  �Lstd2 = 4| µσ|2 + 2 |µ|σ + 2.


So that  L1 ≈ |1 + (1 − α)ψ(1)(α)| + 1/βand  L2 ≈ V ar [x] + 1/β. After standardizing  �α = α, �β = √αand V ar [x] = 1, therefore �Lstd1is a function of  ψ(1)(α)and �Lstd2 = 1 + 1/√α.

Exponential distribution If  X ∼ Exp(λ)then  X ∼ Γ(1, 1/λ), so we can use the previous results so that  L1 ≈V ar [x] and �Lstd1 = 1.

Rayleigh distribution This distribution has parameter  σ > 0, sufficient statistic  T1(x) = x2/2, and natural parameter  η1 = −1/σ2.

We start by computing  ∂η1A(η) = E [T1(x)] = 12 E�x2�. Using that, for this distribution,  E�xj�= σj2j/2Γ(1 + j2):


Therefore,  L1 ≈ µ3/λ + µ/λand  L2 ≈ µ/λ + (2µ + λ)/(µλ2). After standardizing we have that  V ar [�x] = µ3/λ =1 ⇒ λ = µ3, thus �Lstd1 = 1 + 1/µ2and �Lstd2 = (2 + µ2 + µ4)/µ6.

Inverse Gamma distribution This distribution has parameters  α, β > 0, sufficient statistics  T1(x) = log x, T2(x) =1/x, and natural parameters  η1 = −α − 1, η2 = −β.


The interesting bit about these last two estimators is that both explode as they get closer to 2, and both vanish as they get further from it, as it can be readily checked by plotting them.

E.2 Scale-invariant smoothness of the Gamma distribution

In section 4.1 it was introduced the concept of Gamma trick, which acts as a approximation for discrete distributions. Moreover, the discrete variables were assumed to take place in the natural numbers. The reason is that it is beneficial for this approximation that the original variable x is somewhat far from zero.

This statement it is justified by the following: the second derivative of a Gamma log-likelihood with respect to the first natural parameter,  ∂2η1 log p(x; η), rapidly decreases as the data moves away from zero.

As computed before in Equation 28, one part of  L1is scale-invariant and has the form  1 + (1 − α)ψ(1)(α). Figure 7 shows a plot of this formula as a function of  α. It is easy to observe that as the shape parameter grows the value of (our approximation to)  L1drastically decreases.


Figure 7: Plot of  L1for the Gamma distribution.

Finally, by supposing that discrete data are natural numbers, the mode is at least one, which in practice means that the value for  αis bigger than 1 (usually close to 10), thus ensuring that the value of (our approximation to)  L1mostly depends on the scale-dependent parameter  β.

F.1 Missing imputation models

Here we give a deeper description of the models used on the experiments. All of them have the form described in the problem statement (Section 2), following the graphical model depicted in Figure 8.


Figure 8: Latent variable model describing the joint distribution of Section 2.

Mixture model Following the form of the join distribution from Section 2, the mixture model is fully described by:


In order to implement the discrete latent parameters such that they can be trained via automatic differentiation, the latent categorical distribution is implemented using a GumbelSoftmax distribution (Jang et al., 2016) with a temperature that updates every 20 epochs as:


Matrix factorization Similar to the mixture model, the matrix factorization model follows the same graphical model and it is (almost) fully described by:


General notes:

• We assume normal latent variables with a standard normal as prior.

• Hidden layers have 256 neurons.

• The latent size is set to the 75 % of the data number of dimensions (before preprocessing).

• Layers are initialized using a Xavier uniform policy.

Specifics about the encoder:

• As we have to avoid using the missing data (since it is going to be our test set), we implement an input-dropout layer as in Nazabal et al. (2018).

• In order to guarantee a common input (and thus, a common well-behaved neural net) across all data scaling methods, we put a batch-normalization layer at the beginning of the encoder. Note that this does not interfere with the goal of this work, which is about the evaluation of the loss function.

• In order to obtain the distributional parameters of  zn, µnand  σn, we pass the result of the encoder through two linear layers, one for the mean and another for the log-scale. The latter is transformed to the scale via a softplus function.

Specifics about the decoder:


When it comes to evaluation we use missing imputation error, that is, for the imputed missing values that are numerical we compute the normalized root mean squared error (NRMSE),


where  ˆxis the value inferred by the model, and in the case of nominal data we compute the error rate, i.e.,


The final metric is the mean across dimensions,  err = 1D�d err(d).


Figure 9: Missing imputation error across different datasets and missing-values percentages. Lower is better.


Figure 10: Per-dimension normalized missing imputation error on the defaultCredit dataset (lower is better).


Figure 11: Per-dimension normalized missing imputation error on the letter dataset (lower is better).

In this section we show complementary results from the experiments performed in the main paper. First, Figure 9 depicts the same data as Figure 3 of the main paper, but averaging across models instead of missing-values percentages. Second, we plot in Figures 10 and 11 per-dimension barplots of the normalized missing imputation error as in Figure 5, now for the defaultCredit and letter datasets, respectively. These figures further validate the argument of lip-gamma not overlooking any variable, unlike lip-bern and std-gamma. Finally, we present the results in tabular form, divided by type of variable (discrete vs. continuous) and type of model (mixture model, matrix factorization and VAE). Tables 4, 5, and 6 show the results obtained with a 10 %, 20 %, and 50 % of missing values, respectively. Major differences have been colored to ease their reading.

As discussed in Section 5, applying Lipschitz standardization results in an improvement on the imputation error across all datasets, being in the worst case as good as the best of the other methods. We can also observe how this improvement mainly manifests on discrete random variables when the Bernoulli and Gamma tricks are applied, and that the effect of data scaling is less noticeable as the expressiveness of the model increases. There are cases, like in the Adult dataset, where there is a trade-off on learning the discrete dimensions and worsening the results on continuous dimensions. However, the case where properly learning the discrete distributions translates to an improvement on all dimensions can also occur, as in the defaultCredit dataset.

Finally, there is an important aspect that qualitatively differentiates lip-gamma from lip-bern and std-gamma. The consequence of Lipschitz standardizing every dimension is obtaining the more balanced learning that we aim for, and in cases with high heterogeneity, such as defaultCredit and Adult, the stability and robustness of the algorithm increases. A clear example of this can be seen by checking the evolution of the defaultCredit dataset on Tables 4, 5, and 6. It is worth-noting that lip-gamma keeps achieving consistent results even under a half missing-data regime, which is impressive.

Table 4: Missing imputation error with a 10 % of missing data.

Discrete Continuous Imputation error Mixture Matrix fact. VAE Mixture Matrix fact. VAE


Table 5: Missing imputation error with a 20 % of missing data.

Discrete Continuous Imputation error Mixture Matrix fact. VAE Mixture Matrix fact. VAE


Table 6: Missing imputation error with a 50 % of missing data.

Discrete Continuous Imputation error Mixture Matrix fact. VAE Mixture Matrix fact. VAE


Designed for Accessibility and to further Open Science