1b:["$","$L29",null,{"isWhiteLabelled":false,"children":["$","$Lb",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L2a",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMjAwMi4xMTM2OSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2020-10-21T21:24:54.000Z","paperID":"2002.11369","published":"2020-02-26T09:20:10.000Z","authors":"[\"Adrián Javaloy\",\"Isabel Valera\"]","title":"Lipschitz standardization for robust multivariate learning","scoreTrending":null,"summary":"$2b","lastCheckedForCode":"2022-09-04T09:17:48.995Z","links":[{"id":"eyJ1cmwiOiJodHRwczovL3BhcGVyc3dpdGhjb2RlLmNvbS9wYXBlci9saXBzY2hpdHotc3RhbmRhcmRpemF0aW9uLWZvci1yb2J1c3QifQ==","type":"pwc","url":"https://paperswithcode.com/paper/lipschitz-standardization-for-robust","data":null}],"reposConnection":{"edges":[]},"models":[],"tags":[],"summaries":[],"emailsConnection":{"edges":[{"author":"adrian javaloy","node":{"id":"eyJhZGRyZXNzIjoiYWphdmFsb3lAdHVlLm1wZy5kZSJ9","address":"ajavaloy@tue.mpg.de","name":null,"avatar":null,"linkedin":null,"bio":null,"site":null,"override":null,"membership":[],"paper":[{"modelsAggregate":{"count":0}}],"github":[],"scholar":[],"twitter":[],"location":[],"owner":[{"id":"eyJ1aWQiOiIwMzU1MDhiMy1iMWIwLTRhOGUtYTQyMS03YThhMDVlZTRjNDUifQ==","name":"adrian javaloy","github":[],"email":[],"authored":[{"id":"eyJwYXBlcklEIjoiMjAwMi4xMTM2OSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","paperID":"2002.11369"}]}]}}]},"__typename":"paper","authorArray":["Adrián Javaloy","Isabel Valera"]}}],["$","$L18",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L18",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$7",null,{"children":["$","$L2c",null,{"publisher":"arxiv","paperID":"2002.11369","product":{"paper":"$1b:props:children:props:children:0:props:product","models":"$1b:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$7",null,{"children":["$","$L2d",null,{"article":"$L2e","model":"$undefined"}]}]]}],["$","$L18",null,{"size":"grow","children":["$","$L2f",null,{}]}]]}],["$","$7",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L30",null,{"paperID":"2002.11369","publisher":"arxiv","paperJSON":{"title":"Lipschitz standardization for robust multivariate learning","paperID":"2002.11369","avgLineHeight":10.91,"imgScale":4,"sections":[{"heading":"ABSTRACT","paragraphs":[[{"text":"Probabilistic learning is increasingly being tackled as an optimization problem, with gradient-based approaches as predominant methods. When modelling multivariate likelihoods, a usual but undesirable outcome is that the learned model fits only a subset of the observed variables, overlooking the rest. In this work, we study this problem through the lens of multitask learning (MTL), where similar effects have been broadly studied. While MTL solutions do not directly apply in the probabilistic setting—as they cannot handle the likelihood constraints—we show that similar ideas may be leveraged during data preprocessing. First, we show that data standardization often helps under common continuous likelihoods, but it is not enough in the general case, specially under mixed continuous and discrete likelihood models. In order for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"balance multivariate learning","element":"span"},{"text":", we then propose a novel data preprocessing, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Lipschitz standardization","element":"span"},{"text":", which balances the local Lipschitz smoothness across variables. Our experiments on real-world datasets show that Lipschitz standardization leads to more accurate multivariate models than the ones learned using existing data preprocessing techniques. The models and datasets employed in the experiments can be found in ","element":"span"},{"href":"https://github.com/adrianjav/lipschitz-standardization","text":"https://github.com/adrianjav/lipschitz-standardization.","element":"a"}]]},{"heading":"1 Introduction","paragraphs":[[{"id":"id-3","style":{"width":"49%"},"width":928,"height":337,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/0-0.png","element":"img"}],[{"text":"Figure 1: Marginals of continuous (left) and discrete (right) variables from the Adult dataset obtained from a trained VAE. Top to bottom: ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"ground-truth","element":"figcaption","subtype":"caption"},{"text":", actual data; ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"std","element":"figcaption","subtype":"caption"},{"text":", continuous variables were standardized; ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"std-all","element":"figcaption","subtype":"caption"},{"text":", everything was standardized after replacing the discrete distributions by continuous approximations.","element":"figcaption","subtype":"caption"}],[{"text":"In the past few years gradient-based optimization approaches are becoming the gold standard for probabilistic learning. Representative examples of this trend include black box variational inference (BBVI) ","element":"span"},{"href":"#id-0","referenceIndex":18,"text":"(Ran- ","element":"a"},{"href":"#id-0","referenceIndex":18,"text":"ganath et al., ","element":"a"},{"href":"#id-0","referenceIndex":18,"text":"2014) ","element":"a"},{"text":"and Variational Autoencoders (VAE) ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"(Diederik et al., ","element":"a"},{"href":"#id-1","referenceIndex":4,"text":"2014)","element":"a"},{"text":". ","element":"span"},{"text":"However, when such methods are applied to real-world datasets, one often encounters issues such as numerical instabilities.","element":"span"}],[{"text":"As an illustrative example, we learn a VAE on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Adult","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"dataset ","element":"span"},{"text":"from the UCI repository ","element":"span"},{"href":"#id-2","referenceIndex":5,"text":"(Dua and Graff, ","element":"a"},{"href":"#id-2","referenceIndex":5,"text":"2017)","element":"a"},{"text":",","element":"span"}],[{"text":"where every observation is represented by a set of twelve","element":"span"}],[{"text":"mixed continuous and discrete variables, with heteroge-","element":"span"}],[{"text":"neous data distributions (see Figure ","element":"span"},{"href":"#id-3","text":"1)","element":"a"},{"text":". As it is a common practice, we prevent numerical issues by standardizing the continuous variables prior to training the model. However, as shown in Figure ","element":"span"},{"href":"#id-3","text":"1, ","element":"a"},{"text":"while the learned model does a reasonable job at fitting the continuous variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Final weight","element":"span"},{"text":", it results in a poor fit of the discrete variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Occupation","element":"span"},{"text":". Since discrete data seem cumbersome to work with, we then rely on a continuous approximation of these variables ","element":"span"},{"style":{"fontStyle":"italic"},"text":"and standardize every variable ","element":"span"},{"text":"to learn the VAE. Once the VAE is learned, we use the learned parameters to recover the parameters of the discrete likelihoods. In this case, illustrated in the bottom row of Figure ","element":"span"},{"href":"#id-3","text":"1, ","element":"a"},{"text":"the VAE does a better job at capturing the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Occupation ","element":"span"},{"text":"but at the price of a poor fitting of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Final weight","element":"span"},{"text":".","element":"span"}],[{"text":"In order to understand the source of this issue, we need to dive deeper into the problem formulation. In short, the objective function of the VAE can be written as the sum of per-variable losses, i.e., ","element":"span"},{"style":{"height":16.74},"width":192.39,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/0-1.png","element":"img","alt":" L = �d Ld","inline":true},{"text":", and thus be interpreted ","element":"span"},{"text":"as a multitask learning (MTL) problem–where different tasks (variables, in our case) compete for the model parameters during learning. In this context, previous work has shown that disparities in the gradient magnitudes across tasks, may determine which tasks the model prioritizes during training ","element":"span"},{"href":"#id-4","referenceIndex":19,"text":"(Ruder, ","element":"a"},{"href":"#id-4","referenceIndex":19,"text":"2017)","element":"a"},{"text":". Due to the more restrictive nature of probabilistic learning, however, extant solutions from the MTL literature—e.g., GradNorm ","element":"span"},{"href":"#id-5","referenceIndex":3,"text":"(Chen et al., ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"2018)","element":"a"},{"text":"—do not directly apply, as the likelihood would not integrate to one anymore.","element":"span"}],[{"text":"Figure 2: Same setting as in Figure ","element":"figcaption","subtype":"caption"},{"href":"#id-3","text":"1 ","element":"a","subtype":"caption"},{"text":"where now ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"ground-truth ","element":"figcaption","subtype":"caption"},{"text":"is the actual data and ","element":"figcaption","subtype":"caption"},{"style":{"fontWeight":"bold"},"text":"lip-all ","element":"figcaption","subtype":"caption"},{"text":"refers to the variable fit-tings obtained after preprocessing with ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Lipschitz standardization","element":"figcaption","subtype":"caption"},{"text":", fitting well every variable. In this paper, we rely on BBVI as showcase of gradient-based probabilistic learning to show that the solution resides in the data itself. Specifically, in Section ","element":"figcaption","subtype":"caption"},{"id":"id-7","href":"#id-6","text":"2.2, ","element":"a","subtype":"caption"},{"text":"we first formalize the concept of ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"balanced multivariate learning","element":"figcaption","subtype":"caption"},{"text":", which aims to ease that all the observed variables are learned at the same rate, and thus no variable is overlooked. In this context, we are able to study why data standardization often helps towards balanced learning when applied to common continuous likelihood","element":"figcaption","subtype":"caption"}],[{"text":"functions, such as the Gaussian distribution (Section ","element":"span"},{"text":"3)","element":"span"},{"text":". Unfortunately, as shown in our example above, this is not always the case. Then, based on our analysis, we propose ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Lipschitz standardization ","element":"span"},{"text":"(Section ","element":"span"},{"text":"4)","element":"span"},{"text":", a novel preprocessing method that reshapes the data to equalize the local Lipschitz smoothness of the log-likelihood functions across all continuous and discrete variables. As illustrated in Figure ","element":"span"},{"href":"#id-7","text":"2, ","element":"a"},{"style":{"fontStyle":"italic"},"text":"Lipschitz standardization ","element":"span"},{"text":"facilitates a more accurate fitting by balancing learning across all variables.","element":"span"}],[{"text":"Finally, we test Lipschitz standardization prior to learning different probabilistic models (mixture models, probabilistic matrix factorization, and VAEs) on six real-world datasets (see Section ","element":"span"},{"text":"5)","element":"span"},{"text":". Our results show the effectiveness of the proposed method which leads to a more balanced learning across dimensions, greatly improving the final performance across dimensions on most settings, being in the worst case as good as the best of the considered baseline preprocessing methods, including data standardization.","element":"span"}]]},{"heading":"2 Problem Statement","paragraphs":[[{"text":"Let us assume a set of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"N ","element":"span"},{"text":"observations ","element":"span"},{"style":{"height":17.38},"width":241.95,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-0.png","element":"img","alt":" X = {xn}Nn=1","inline":true},{"text":", each with ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"different features ","element":"span"},{"style":{"height":17.9},"width":259.05,"height":44.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-1.png","element":"img","alt":" xn = {xnd}Dd=1","inline":true},{"text":". Follow- ","element":"span"},{"text":"ing ","element":"span"},{"href":"#id-8","referenceIndex":9,"text":"Hoffman et al. ","element":"a"},{"href":"#id-8","referenceIndex":9,"text":"(2013)","element":"a"},{"text":", we consider that the joint distribution over the observed variables ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"X","element":"span"},{"text":", local latent variables ","element":"span"},{"style":{"height":17.38},"width":232.21,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-2.png","element":"img","alt":" Z = {zn}Nn=1","inline":true},{"text":", and global latent variables ","element":"span"},{"style":{"height":14},"width":27,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-3.png","element":"img","alt":" β","inline":true},{"text":", is given by the fairly simple—yet general—latent variable model ","element":"span"},{"style":{"height":20.4},"width":740.4,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-4.png","element":"img","alt":"p(X, Z, β) = p(β) �Nn=1 p(xn|zn, β)p(zn)","inline":true},{"text":". To account for mixed likelihood models, we further assume that the ","element":"span"},{"text":"likelihood factorizes per dimension as","element":"span"}],[{"id":"id-13","style":{"width":"64%"},"width":1206,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-5.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.1},"width":62.08,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-6.png","element":"img","alt":" ηnd","inline":true,"padRight":true},{"text":"denotes the likelihood parameters given by the latent variables ","element":"span"},{"style":{"height":9.59},"width":43.8,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-7.png","element":"img","alt":" zn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14},"width":27,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-8.png","element":"img","alt":" β","inline":true,"padRight":true},{"text":"for each ","element":"span"},{"style":{"height":16},"width":267,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-9.png","element":"img","alt":" xnd, ηnd(zn, β)","inline":true},{"text":".","element":"span"}],[{"text":"Furthermore, we rely on BBVI ","element":"span"},{"href":"#id-0","referenceIndex":18,"text":"(Ranganath et al., ","element":"a"},{"href":"#id-0","referenceIndex":18,"text":"2014) ","element":"a"},{"text":"to approximate the posterior distribution over the latent variables, ","element":"span"},{"style":{"height":16},"width":182.39,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-10.png","element":"img","alt":" p(Z, β|X)","inline":true},{"text":". For simplicity in exposition, we assume a mean-field variational distribution family of the form ","element":"span"},{"style":{"height":22.57},"width":551.12,"height":56.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-11.png","element":"img","alt":" q(Z, β) = qγβ(β) �Nn=1 qγn(zn)","inline":true},{"text":", where ","element":"span"},{"style":{"height":17.38},"width":147.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-12.png","element":"img","alt":" {γn}Nn=1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.5},"width":47.06,"height":33.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-13.png","element":"img","alt":" γβ","inline":true,"padRight":true},{"text":"are respectively the local and global variational ","element":"span"},{"text":"parameters. We denote by ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-14.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"the set of all variational parameters. BBVI relies on (stochastic) gradient ascent to find the parameters that maximize the evidence lower bound (ELBO),","element":"span"},{"text":"1 ","element":"span"},{"text":"i.e.,","element":"span"}],[{"id":"id-9","style":{"width":"80%"},"width":1508,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-15.png","element":"img"}],[{"text":"BBVI ","element":"span"},{"text":"performs ","element":"span"},{"text":"iterative ","element":"span"},{"text":"updates ","element":"span"},{"text":"over ","element":"span"},{"text":"the ","element":"span"},{"text":"variational ","element":"span"},{"text":"(global ","element":"span"},{"text":"and ","element":"span"},{"text":"local) ","element":"span"},{"text":"parameters ","element":"span"},{"text":"of ","element":"span"},{"text":"the ","element":"span"},{"text":"form ","element":"span"},{"style":{"height":18.18},"width":495.32,"height":45.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-16.png","element":"img","alt":"γt = γt−1 + α∇γL(X, γ, ϕ)","inline":true,"padRight":true},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"is the current step in the optimization procedure. ","element":"span"},{"text":"We further assume that the reparametrization trick ","element":"span"},{"href":"#id-1","referenceIndex":4,"text":"(Diederik et al., ","element":"a"},{"href":"#id-1","referenceIndex":4,"text":"2014) ","element":"a"},{"text":"can be applied on the latent variables (i.e., ","element":"span"},{"style":{"height":16},"width":261.88,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-17.png","element":"img","alt":" Z, β = f(γ, ε)","inline":true},{"text":", where ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-18.png","element":"img","alt":" ε","inline":true,"padRight":true},{"text":"is a noise variable), such that the gradient of Eq. ","element":"span"},{"href":"#id-9","text":"2 ","element":"a"},{"text":"can be computed as:","element":"span"}],[{"id":"id-10","style":{"width":"79%"},"width":1497,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-19.png","element":"img"}],[{"text":"where we denote the log-likelihood ","element":"span"},{"style":{"height":16},"width":291.83,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-20.png","element":"img","alt":" log pd(xd; ηd(γ))","inline":true,"padRight":true},{"text":"by ","element":"span"},{"style":{"height":16},"width":167.71,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-21.png","element":"img","alt":" ℓd(ηd(γ))","inline":true},{"text":", making explicit the dependency of the log-likelihood evaluation to the variational parameters ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-22.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"through the likelihood parameters ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-23.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"while making implicit its dependency with ","element":"span"},{"style":{"height":9.59},"width":43.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-24.png","element":"img","alt":" xd","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/1-25.png","element":"img","alt":" ε","inline":true},{"text":".","element":"span"}],[{"text":"A closer look to Eq. ","element":"span"},{"href":"#id-10","text":"3 ","element":"a"},{"text":"shows that each dimension in the data contributes to the overall gradient computation in an additive way. Therefore, the gradient evaluation with respect to the shared parameters—and in consequence the learning process—can be monopolized by a small subset of dimensions if their gradients dominate this sum in Eq. ","element":"span"},{"href":"#id-10","text":"3. ","element":"a"},{"text":"In other words, while the objective is to capture the joint distribution of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"all dimensions","element":"span"},{"text":", differences in the gradient evaluation across different observed variables (e.g., Gaussian vs. multinomial) may result in a latent variable model that poorly fits a subset of the observed dimensions, as we already observed in the example of Section ","element":"span"},{"text":"1.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Connections with multitask learning","element":"span"}],[{"text":"The gradient computation in Eq. ","element":"span"},{"href":"#id-10","text":"3—","element":"a"},{"text":"and the undesirable scenario described in the above—may result familiar to those readers knowledgeable about MTL literature. In MTL it is common to have a set of shared parameters ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-0.png","element":"img","alt":" γ","inline":true,"padRight":true},{"text":"whose gradient are of the form ","element":"span"},{"style":{"height":16.75},"width":303.95,"height":41.86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-1.png","element":"img","alt":" ∇γL = �d ∇γLd","inline":true},{"text":", where the sum is taken over all the tasks and each ","element":"span"},{"style":{"height":13.19},"width":44.48,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-2.png","element":"img","alt":" Ld","inline":true,"padRight":true},{"text":"is the loss function ","element":"span"},{"text":"of a particular task. When great disparities exist between task gradients during learning, the resulting model may poorly perform on some tasks, an effect attributed to the competition between tasks for the shared parameters and known as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"negative transfer ","element":"span"},{"href":"#id-4","referenceIndex":19,"text":"(Ruder, ","element":"a"},{"href":"#id-4","referenceIndex":19,"text":"2017)","element":"a"},{"text":". Hence, the (variational) inference problem stated in Eq. ","element":"span"},{"href":"#id-9","text":"2 ","element":"a"},{"text":"may also be interpreted as a (more restrictive) MTL problem where the input variables play the role of tasks, and the inference parameters are shared.","element":"span"}],[{"text":"Given a set of fixed tasks, the most common approach in MTL is to tackle the previous problem using adaptive solutions ","element":"span"},{"href":"#id-5","referenceIndex":3,"text":"(Chen et al., ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"2018; ","element":"a"},{"href":"#id-11","referenceIndex":13,"text":"Kendall et al., ","element":"a"},{"href":"#id-11","referenceIndex":13,"text":"2018; ","element":"a"},{"href":"#id-12","referenceIndex":7,"text":"Guo et al., ","element":"a"},{"href":"#id-12","referenceIndex":7,"text":"2018)","element":"a"},{"text":". These solutions add a set of weights to the loss function, ","element":"span"},{"style":{"height":16.74},"width":240.5,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-3.png","element":"img","alt":" L = �d ωdLd","inline":true},{"text":", and dynamically change their value—based on different criteria—so that the magnitude of ","element":"span"},{"text":"each task gradient ","element":"span"},{"style":{"height":15.59},"width":100.26,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-4.png","element":"img","alt":" ∇γLd","inline":true,"padRight":true},{"text":"is comparable to the ones of other tasks.","element":"span"}],[{"text":"Unfortunately, this type of solutions cannot be applied in the probabilistic setting since, as we mentioned before, we face a more restrictive problem. Specifically, by adding this set of weights in Eq. ","element":"span"},{"href":"#id-9","text":"2, ","element":"a"},{"text":"we would also modify the likelihood in Eq. ","element":"span"},{"href":"#id-13","text":"1, ","element":"a"},{"text":"which would no longer integrate to one as required.","element":"span"}],[{"id":"id-6","style":{"fontWeight":"bold"},"text":"2.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Balanced multivariate learning","element":"span"}],[{"text":"In variational inference, or more generally, in approximate Bayesian inference, we aim to accurately capture the posterior distribution of the latent variables explaining the joint distribution over all the observed variables, and not just a subset of them. Ideally, we want to follow a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"balanced multivariate learning process","element":"span"},{"text":", where the normalized likelihood improvement per iteration ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"+ 1 ","element":"span"},{"text":"is the same for all dimensions, i.e.,","element":"span"}],[{"id":"id-15","style":{"width":"64%"},"width":1215,"height":98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-5.png","element":"img"}],[{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , D","element":"span"},{"text":", where ","element":"span"},{"style":{"height":16.98},"width":42.06,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-6.png","element":"img","alt":" γ0","inline":true,"padRight":true},{"text":"denotes the initialization of the variational parameters, and ","element":"span"},{"style":{"height":12.98},"width":43.33,"height":32.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-7.png","element":"img","alt":" Ct","inline":true,"padRight":true},{"text":"the constant improvement at step ","element":"span"},{"style":{"fontStyle":"italic"},"text":"t ","element":"span"},{"text":"for all dimensions.","element":"span"}],[{"text":"This is to the best of our knowledge the first time that balanced learning is properly defined, but its relevance has been acknowledge in prior MTL work (e.g., Eq. (6) of ","element":"span"},{"href":"#id-14","referenceIndex":15,"text":"Milojkovic et al., ","element":"a"},{"href":"#id-14","referenceIndex":15,"text":"2019)","element":"a"},{"text":". Of special interest is GradNorm ","element":"span"},{"href":"#id-5","referenceIndex":3,"text":"(Chen ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"et al., ","element":"a"},{"href":"#id-5","referenceIndex":3,"text":"2018)","element":"a"},{"text":", an adaptive solution whose weights are tuned to “dynamically adjust gradient norms so different tasks train at similar rates”, including ","element":"span"},{"style":{"height":17.38},"width":427.1,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-8.png","element":"img","alt":" ℓd(ηd(γt+1))/ℓd(ηd(γ0))","inline":true,"padRight":true},{"text":"in their formulation. Unfortunately, Eq. ","element":"span"},{"href":"#id-15","text":"4 ","element":"a"},{"text":"turns out to be an unrealistic goal for the scope of this work.","element":"span"}],[{"text":"To find a more feasible objective, we focus on the class of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth functions, which is the broadest class of functions with convergence guarantees in gradient descent. A function ","element":"span"},{"style":{"height":16},"width":76.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-9.png","element":"img","alt":" ℓ(γ)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth on ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"with respect to ","element":"span"},{"style":{"height":14.4},"width":116.14,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-10.png","element":"img","alt":" γ ∈ Q","inline":true,"padRight":true},{"text":"if it is twice-differentiable and, for any ","element":"span"},{"style":{"height":14},"width":144.39,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-11.png","element":"img","alt":" a, b ∈ Q","inline":true},{"text":", it holds that:","element":"span"}],[{"id":"id-27","style":{"width":"65%"},"width":1227,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-12.png","element":"img"}],[{"text":"For such class of functions, there exist theoretical results on the convergence rate to a critical point as a function of the Lipschitz constant ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L ","element":"span"},{"text":"and number of steps ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T ","element":"span"},{"href":"#id-16","referenceIndex":17,"text":"(Nesterov, ","element":"a"},{"href":"#id-16","referenceIndex":17,"text":"2018)","element":"a"},{"text":". Using our notation, this rate can be written as ","element":"span"},{"style":{"height":19.22},"width":741.95,"height":48.05,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-13.png","element":"img","alt":"mint=1,2,...,T ||∇γℓd(ηd(γt))|| = O(�L/T)","inline":true},{"text":". Note that this implies ","element":"span"},{"style":{"height":17.78},"width":368.87,"height":44.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-14.png","element":"img","alt":" ||∇γℓd(ηd(γt))|| → 0","inline":true,"padRight":true},{"text":"as ","element":"span"},{"style":{"height":10.4},"width":119.98,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-15.png","element":"img","alt":" t → ∞","inline":true},{"text":", and in turn, ","element":"span"},{"style":{"height":19.96},"width":700.2,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-16.png","element":"img","alt":"��∇γℓd(ηd(γt+1)) − ∇γℓd(ηd(γt))�� → 0","inline":true},{"text":". We can thus replace Eq. ","element":"span"},{"href":"#id-15","text":"4 ","element":"a"},{"text":"by","element":"span"}],[{"id":"id-17","style":{"width":"69%"},"width":1301,"height":107,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-17.png","element":"img"}],[{"text":"which instead focuses on the difference between consecutive gradients to be proportionally equal across dimensions. Finally, assuming a good parameter initialization ","element":"span"},{"style":{"height":16.98},"width":42.06,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/2-18.png","element":"img","alt":" γ0","inline":true,"padRight":true},{"text":"such that the initial gradient magnitudes are comparable across ","element":"span"},{"text":"dimensions, we can consider constant the denominator from Eq. ","element":"span"},{"href":"#id-17","text":"6 ","element":"a"},{"text":"as well. As a result, forcing every dimension to be ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth, i.e.,","element":"span"}],[{"style":{"height":19.96},"width":930.51,"height":49.91,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-0.png","element":"img","alt":"��∇γℓd(ηd(γt+1)) − ∇γℓd(ηd(γt))�� ≤ L��γt+1 − γt��","inline":true,"padRight":true},{"text":"(7) turns out to be a weaker version of Eq. ","element":"span"},{"href":"#id-17","text":"6, ","element":"a"},{"text":"whose goal is to ease a more balanced multivariate learning process.","element":"span"}],[{"text":"In the following section, we study the impact of data standardization on the learning process. To this end, we show the relationship between the Lipschitz constants of the likelihood functions evaluated on the original and the standardized data. We then propose an estimator of the (local) Lipschitz constant, which allows us to show that, while data standardization may help, unfortunately in some cases is may counterproductive for balanced multivariate learning.","element":"span"}]]},{"heading":"3 The effect of standardization","paragraphs":[[{"text":"Preprocessing methods (e.g., standardization) are widely used in statistics and machine learning. However, there is a priori no way of deciding which one to use ","element":"span"},{"href":"#id-18","referenceIndex":6,"text":"(Gnanadesikan et al., ","element":"a"},{"href":"#id-18","referenceIndex":6,"text":"1995; ","element":"a"},{"href":"#id-19","referenceIndex":12,"text":"Juszczak et al., ","element":"a"},{"href":"#id-19","referenceIndex":12,"text":"2002; ","element":"a"},{"href":"#id-20","referenceIndex":14,"text":"Milligan and Cooper, ","element":"a"},{"href":"#id-20","referenceIndex":14,"text":"1988)","element":"a"},{"text":". In distance-based machine learning methods, e.g. clustering, the effectiveness of these two methods can be readily understood since they bring all the data into a similar range, making the distance between points comparable across dimensions ","element":"span"},{"href":"#id-21","referenceIndex":1,"text":"(Aksoy and Haralick, ","element":"a"},{"href":"#id-21","referenceIndex":1,"text":"2001)","element":"a"},{"text":". In other approaches, such as maximum likelihood or variational inference, the distance argument becomes less convincing,","element":"span"},{"text":"2 ","element":"span"},{"text":"since explicit distance between observations is no longer evaluated. Another argument is that they usually improve numerical stability by moving the data, and thus the model parameters, to a well-behaved part of the real space. Since computers struggle to work with tiny and large values, this would have an inherent effect in the learning process.","element":"span"}],[{"text":"In this section, we study the impact that dimension-wise data preprocessing, specifically scaling transformations of the form ","element":"span"},{"style":{"height":6.8},"width":130.27,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-1.png","element":"img","alt":" �x = ωx","inline":true},{"text":", has on BBVI as an example of Bayesian inference methods based on first order optimization. We choose scaling transformations since: i) they preserve important properties of the data distribution, such as domain and tails; and ii) they are broadly used in practice ","element":"span"},{"href":"#id-22","referenceIndex":8,"text":"(Han et al., ","element":"a"},{"href":"#id-22","referenceIndex":8,"text":"2011)","element":"a"},{"text":". Note that as shifting the data, ","element":"span"},{"style":{"height":10},"width":177.87,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-2.png","element":"img","alt":" �x = x − µ","inline":true},{"text":", may violate distributional restrictions (e.g., non-negativity), we assume that the data may have been already shifted prior to the likelihood selection. Specifically, our main focus is on three broadly-used data scaling methods:","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Standardization: ","element":"span"},{"style":{"height":16},"width":278.11,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-3.png","element":"img","alt":" �xnd = xnd/stdd,","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Normalization: ","element":"span"},{"style":{"height":16},"width":298.92,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-4.png","element":"img","alt":" �xnd = xnd/maxd,","inline":true,"padRight":true},{"style":{"fontWeight":"bold"},"text":"Interquartile range: ","element":"span"},{"style":{"height":16},"width":272.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-5.png","element":"img","alt":" �xnd = xnd/iqrd,","inline":true}],[{"text":"where ","element":"span"},{"style":{"height":13.2},"width":186.2,"height":33,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-6.png","element":"img","alt":" stdd, maxd","inline":true},{"text":", and ","element":"span"},{"style":{"height":14.7},"width":64.71,"height":36.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-7.png","element":"img","alt":" iqrd","inline":true,"padRight":true},{"text":"denote the empirical standard deviation, absolute maximum, and interquartile range of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-th dimension, respectively.","element":"span"}],[{"text":"Next, we introduce a novel perspective on the effect of data scaling in inference methods based on first-order optimization. In a similar way as ","element":"span"},{"href":"#id-23","referenceIndex":20,"text":"Santurkar et al. ","element":"a"},{"href":"#id-23","referenceIndex":20,"text":"(2018) ","element":"a"},{"text":"showed that batch normalization ","element":"span"},{"href":"#id-24","referenceIndex":10,"text":"(Ioffe and Szegedy, ","element":"a"},{"href":"#id-24","referenceIndex":10,"text":"2015) ","element":"a"},{"text":"smooths out the optimization landscape of the loss function, we show that data standardization often smooths out the log-likelihood optimization landscape in a similar way across dimensions. Importantly, by applying the chain rule to the gradient computation, i.e., ","element":"span"},{"style":{"height":16.79},"width":479.97,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-8.png","element":"img","alt":" ∇γℓ(η(γ)) = ∇ηℓ(η) · ∇γη","inline":true},{"text":", we can focus on the data-dependent part, the likelihood gradient ","element":"span"},{"style":{"height":16.79},"width":130.97,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-9.png","element":"img","alt":" ∇ηℓ(η)","inline":true},{"text":".","element":"span"},{"text":"3 ","element":"span"},{"text":"In the following, we denote by ","element":"span"},{"style":{"height":16},"width":409.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-10.png","element":"img","alt":"�ℓd(�ηd) := log pd(�xd; �ηd)","inline":true,"padRight":true},{"text":"the likelihood function (with parameters ","element":"span"},{"style":{"height":11.1},"width":42.39,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-11.png","element":"img","alt":"�ηd","inline":true},{"text":") evaluated on the scaled data.","element":"span"}],[{"id":"id-30","style":{"fontWeight":"bold"},"text":"3.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Scaling the exponential family","element":"span"}],[{"style":{"width":"99%"},"width":1870,"height":130,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-12.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":186.02,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-13.png","element":"img","alt":" ηnd(zn, β)","inline":true,"padRight":true},{"text":"denotes the natural parameters parameretized by the latent variables, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"the sufficient statistics, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is the base measure, and ","element":"span"},{"style":{"height":16},"width":86.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-14.png","element":"img","alt":" A(η)","inline":true,"padRight":true},{"text":"the log-partition function. Note that both ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-15.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"are vectors of size ","element":"span"},{"style":{"height":13.19},"width":34.52,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-16.png","element":"img","alt":" Id","inline":true},{"text":". Working with the exponential family let us draw one really useful relation between scaled and original data: ","element":"span"},{"id":"id-25","style":{"fontWeight":"bold"},"text":"Proposition 3.1 ","element":"span"},{"text":"(Simplified)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":117.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-17.png","element":"img","alt":" p(x; η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a member of the exponential family where ","element":"span"},{"style":{"height":11.6},"width":100.48,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-18.png","element":"img","alt":" x ∈ R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.98},"width":117.87,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-19.png","element":"img","alt":" η ∈ RI","inline":true},{"style":{"fontStyle":"italic"},"text":". Besides, let us define ","element":"span"},{"style":{"height":6.8},"width":136.24,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-20.png","element":"img","alt":" �x := ωx","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for a given ","element":"span"},{"style":{"height":11.6},"width":103.94,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-21.png","element":"img","alt":" ω ∈ R","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, if every sufficient statistic can be factorized as ","element":"span"},{"style":{"height":16},"width":446.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-22.png","element":"img","alt":" Ti(�x) = fi(ω)Ti(x)+gi(ω)","inline":true},{"style":{"fontStyle":"italic"},"text":", the following holds:","element":"span"}],[{"style":{"height":22.38},"width":648.63,"height":55.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-23.png","element":"img","alt":"∂j�ηi log p(�x, �η) = fi(ω)j ∂jηi log p(x; η),","inline":true,"padRight":true},{"text":"(9) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":19.32},"width":48.58,"height":48.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-24.png","element":"img","alt":" ∂jηi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":22.38},"width":48.58,"height":55.94,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-25.png","element":"img","alt":" ∂j�ηi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denote the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"style":{"fontStyle":"italic"},"text":"-th partial derivative with respect to ","element":"span"},{"style":{"height":10.4},"width":30.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-26.png","element":"img","alt":" ηi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":240.75,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/3-27.png","element":"img","alt":" �ηi := ηi/fi(ω)","inline":true},{"style":{"fontStyle":"italic"},"text":", respectively.","element":"span"}],[{"id":"id-26","text":"Table 1: First two columns: Multiplicative and additive noise (see Prop. ","element":"figcaption","subtype":"caption"},{"href":"#id-25","text":"3.1) ","element":"a","subtype":"caption"},{"text":"for some common distributions (parameterized for simplicity with the canonical parameters, instead of the natural ones). When ","element":"figcaption","subtype":"caption"},{"style":{"height":14},"width":30.51,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-0.png","element":"img","alt":" fi","inline":true,"padRight":true},{"text":"or ","element":"figcaption","subtype":"caption"},{"style":{"height":10},"width":30.01,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-1.png","element":"img","alt":" gi","inline":true,"padRight":true},{"text":"is omitted, it is assumed to be ","element":"figcaption","subtype":"caption"},{"text":"1 ","element":"figcaption","subtype":"caption"},{"text":"or ","element":"figcaption","subtype":"caption"},{"text":"0","element":"figcaption","subtype":"caption"},{"text":", respectively. Last two columns: ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"L","element":"figcaption","subtype":"caption"},{"text":"-smoothness of the scaled likelihood (parameterized by ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":35.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-2.png","element":"img","alt":" �η1","inline":true,"padRight":true},{"text":"and ","element":"figcaption","subtype":"caption"},{"style":{"height":10.4},"width":35.78,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-3.png","element":"img","alt":" �η2","inline":true},{"text":") as a function of the original (canonical) likelihood parameters. ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Rat ","element":"figcaption","subtype":"caption"},{"text":"denotes a rational function, and ","element":"figcaption","subtype":"caption"},{"style":{"height":17.38},"width":67.73,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-4.png","element":"img","alt":" ψ(1)","inline":true,"padRight":true},{"text":"the trigamma function.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"72%"},"width":1364,"height":375,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-5.png","element":"img"}],[{"text":"A more complex version of the proposition and its proof can be found Appendix ","element":"span"},{"text":"C. ","element":"span"},{"text":"Although the proposition’s requirements may look restrictive at first, as reported in Table ","element":"span"},{"href":"#id-26","text":"1, ","element":"a"},{"text":"many commonly-used distributions fulfil such properties. It also is worth-mentioning that in the case of the log-normal distribution we consider the scaling function ","element":"span"},{"style":{"height":10.58},"width":126.28,"height":26.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-6.png","element":"img","alt":" �x = xω","inline":true},{"text":", instead of ","element":"span"},{"style":{"height":6.8},"width":125.17,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-7.png","element":"img","alt":" �x = ωx","inline":true},{"text":".","element":"span"}],[{"text":"Assume now that ","element":"span"},{"style":{"height":16},"width":75.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-8.png","element":"img","alt":" ℓ(η)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-9.png","element":"img","alt":" Li","inline":true},{"text":"-smooth with respect to its ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"text":"-th natural parameter, ","element":"span"},{"style":{"height":10.4},"width":30.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-10.png","element":"img","alt":" ηi","inline":true},{"text":". Using Proposition ","element":"span"},{"href":"#id-25","text":"3.1, ","element":"a"},{"text":"we obtain the Lipschitz constant of the scaled likelihood ","element":"span"},{"style":{"height":16},"width":110.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-11.png","element":"img","alt":"�ℓd(�ηd)","inline":true,"padRight":true},{"text":"as a function of the original one ","element":"span"},{"style":{"height":16},"width":75.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-12.png","element":"img","alt":" ℓ(η)","inline":true},{"text":", i.e.,","element":"span"}],[{"id":"id-43","style":{"width":"85%"},"width":1599,"height":118,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16.58},"width":173.62,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-14.png","element":"img","alt":" �a, �b ∈ RI","inline":true,"padRight":true},{"text":"are two different (scaled) parameters and the last expression is a result of the Cauchy-Schwarz inequality. Assuming the ","element":"span"},{"text":"1","element":"span"},{"text":"-norm, this implies that the scaled log-likelihood ","element":"span"},{"style":{"height":16},"width":71.31,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-15.png","element":"img","alt":"�ℓ(�η)","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-16.png","element":"img","alt":"�Li","inline":true},{"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.4},"width":30.78,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-17.png","element":"img","alt":" �ηi","inline":true},{"text":", with","element":"span"}],[{"id":"id-28","style":{"width":"63%"},"width":1185,"height":92,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-18.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"3.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"“Standardizing” the optimization landscape","element":"span"}],[{"text":"In order to quantify the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness of a function, we need to compute its Lipschitz constant. As we are considering here data scaling transformations, i.e., a preprocessing step, we focus on the local ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness around the empirical estimation of the natural parameters, denoted by ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-19.png","element":"img","alt":" �η","inline":true},{"text":". As an example, assuming a Gaussian variable with empirical mean and standard deviation denoted by ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-20.png","element":"img","alt":" �µ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-21.png","element":"img","alt":" �σ","inline":true},{"text":", then ","element":"span"},{"style":{"height":17.39},"width":174.94,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-22.png","element":"img","alt":" �η1 = �µ/�σ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":221.77,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-23.png","element":"img","alt":" �η2 = −1/2�σ2","inline":true},{"text":".","element":"span"}],[{"text":"Unfortunately, calculating the (","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-24.png","element":"img","alt":"ε","inline":true},{"text":"-local) Lipschitz constant may be challenging, as it involves solving","element":"span"}],[{"style":{"width":"66%"},"width":1251,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-25.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":125.4,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-26.png","element":"img","alt":" B(�η, ε)","inline":true,"padRight":true},{"text":"is the ball with radius ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-27.png","element":"img","alt":" ε","inline":true,"padRight":true},{"text":"and centered in the empirical estimation of the natural parameters ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-28.png","element":"img","alt":" �η","inline":true},{"text":". Instead, we here rely on an estimator of ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-29.png","element":"img","alt":" Li","inline":true},{"text":", which is derived by taking the limit ","element":"span"},{"style":{"height":11.2},"width":110.65,"height":28,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-30.png","element":"img","alt":" ε → 0","inline":true,"padRight":true},{"text":"and making use of the multivariate mean value theorem.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Theorem 3.1 ","element":"span"},{"text":"(Mean Value Theorem)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":73.49,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-31.png","element":"img","alt":" ℓ(η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a twice-differentiable real-valued function with respect to ","element":"span"},{"style":{"height":12.4},"width":110.2,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-32.png","element":"img","alt":" ηi ∈ η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"on ","element":"span"},{"style":{"height":16.58},"width":128.42,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-33.png","element":"img","alt":" Q ⊂ RI","inline":true},{"style":{"fontStyle":"italic"},"text":". Then, for any two values ","element":"span"},{"style":{"height":14},"width":144.4,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-34.png","element":"img","alt":" a, b ∈ Q","inline":true},{"style":{"fontStyle":"italic"},"text":", there exists ","element":"span"},{"style":{"height":14},"width":101.16,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-35.png","element":"img","alt":" c ∈ Q","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that","element":"span"}],[{"style":{"width":"38%"},"width":724,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-36.png","element":"img"}],[{"text":"By taking norms above and applying the Cauchy-Schwarz inequality we obtain the same inequality as in Eq. ","element":"span"},{"href":"#id-27","text":"5, ","element":"a"},{"style":{"height":20.69},"width":821.44,"height":51.72,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-37.png","element":"img","alt":"��∂ηiℓ(a) − ∂ηiℓ(b)�� ≤��∇η∂ηiℓ(c)�� · ||a − b||","inline":true},{"text":". Setting ","element":"span"},{"style":{"height":10.8},"width":108.92,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-38.png","element":"img","alt":" c = �η","inline":true},{"text":", we obtain our local estimator of the Lipschitz constant as:","element":"span"}],[{"style":{"width":"67%"},"width":1268,"height":87,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/4-39.png","element":"img"}],[{"text":"Importantly, if ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-0.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-1.png","element":"img","alt":" Li","inline":true},{"text":"-smooth for each ","element":"span"},{"style":{"height":10.4},"width":30.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-2.png","element":"img","alt":" ηi","inline":true,"padRight":true},{"text":"in the set of natural parameters ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-3.png","element":"img","alt":" η","inline":true},{"text":", then it is ","element":"span"},{"style":{"height":16.74},"width":100.09,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-4.png","element":"img","alt":"�i Li","inline":true},{"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-5.png","element":"img","alt":" η","inline":true},{"text":". ","element":"span"},{"text":"Similarly, if ","element":"span"},{"style":{"height":7.6},"width":32.61,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-6.png","element":"img","alt":" ℓ1","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-7.png","element":"img","alt":" L1","inline":true},{"text":"-smooth and ","element":"span"},{"style":{"height":13.19},"width":89.1,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-8.png","element":"img","alt":" ℓ2 L2","inline":true},{"text":"-smooth, then ","element":"span"},{"style":{"height":11.59},"width":118.08,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-9.png","element":"img","alt":" ℓ1 + ℓ2","inline":true,"padRight":true},{"text":"is ","element":"span"},{"style":{"height":16},"width":172.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-10.png","element":"img","alt":" (L1 + L2)","inline":true},{"text":"-smooth.","element":"span"},{"text":"4 ","element":"span"},{"text":"These properties are proved in Appendix ","element":"span"},{"text":"B.","element":"span"}],[{"text":"Moreover, for the distributions considered in Table ","element":"span"},{"href":"#id-26","text":"1, ","element":"a"},{"text":"we can use our estimator to approximate the resulting ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness after standardizing the data (details in Appendix ","element":"span"},{"text":"E)","element":"span"},{"text":". These results shed some light on why standardizing works well in many settings, since ","element":"span"},{"style":{"fontStyle":"italic"},"text":"it makes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontStyle":"italic"},"text":"-smoothness comparable across dimensions ","element":"span"},{"text":"for several common likelihood functions. Specifically, i) the exponential and Rayleigh distributions have constant (local) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness; ii) a centered (log-)normal distribution is ","element":"span"},{"text":"3","element":"span"},{"text":"-smooth; and iii) the Gamma distribution is (approximately) ","element":"span"},{"text":"1","element":"span"},{"text":"-smooth as long as its shape parameter ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-11.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"(which is scale-invariant, i.e., ","element":"span"},{"style":{"height":10.8},"width":105.32,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-12.png","element":"img","alt":" ˜α = α","inline":true},{"text":") is sufficiently large. However, Table ","element":"span"},{"href":"#id-26","text":"1 ","element":"a"},{"text":"also showcases that for other likelihood the resulting Lipschitz constants may not be comparable. This is the case for the inverse Gaussian (Gamma) distribution, whose Lipschitz constants after standardizing are rational functions of ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-13.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"(of ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-14.png","element":"img","alt":" α","inline":true},{"text":") that can be arbitrarily large or small.","element":"span"}]]},{"heading":"4 Lipschitz standardization","paragraphs":[[{"text":"In the previous section we observed that the Lipschitz constant after scaling the data, ","element":"span"},{"style":{"height":16},"width":98.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-15.png","element":"img","alt":"�Li(ω)","inline":true},{"text":", can be seen as a function of the scaling factor ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-16.png","element":"img","alt":" ω","inline":true},{"text":". As a consequence, it should be possible to find an ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-17.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"that eases balanced multivariate learning by making all the dimensions in the data share the same Lipschitz constant. In this section, we propose a novel data scaling algorithm with this same goal in mind, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Lipschitz standardization","element":"span"},{"text":". Intuitively, our algorithm puts the data into a region of the parameter space where the local ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness is comparable across all dimensions.","element":"span"}],[{"text":"Given a single ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smooth function ","element":"span"},{"style":{"height":16},"width":76.15,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-18.png","element":"img","alt":" ℓ(γ)","inline":true},{"text":", it can be shown that there exists an optimal step size ","element":"span"},{"style":{"height":16},"width":163.95,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-19.png","element":"img","alt":" α∗ = 1/L","inline":true,"padRight":true},{"text":"for first-order optimization ","element":"span"},{"href":"#id-16","referenceIndex":17,"text":"(Nesterov, ","element":"a"},{"href":"#id-16","referenceIndex":17,"text":"2018)","element":"a"},{"text":". However, when we aim to jointly fit multiple functions, in our case log-likelihood functions ","element":"span"},{"style":{"height":17.9},"width":264.11,"height":44.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-20.png","element":"img","alt":" {ℓd(ηd(γ))}Dd=1","inline":true},{"text":", each one being ","element":"span"},{"style":{"height":13.19},"width":44.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-21.png","element":"img","alt":" Ld","inline":true},{"text":"-smooth, the optimal learning rate for each individual likelihood is ","element":"span"},{"text":"different, although the parameters (in our case, the variational parameters ","element":"span"},{"style":{"height":10.8},"width":25,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-22.png","element":"img","alt":" γ","inline":true},{"text":") that we optimize are shared. Importantly, while there exists an optimal learning rate for the overall likelihood function ","element":"span"},{"style":{"height":16.78},"width":366.68,"height":41.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-23.png","element":"img","alt":" ℓ(γ) = �d ℓd(ηd(γ))","inline":true},{"text":", it may still lead ","element":"span"},{"text":"to an unbalanced learning process, and thus, to inaccurate fitting of the data.","element":"span"}],[{"text":"The proposed ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Lipschitz standardization ","element":"span"},{"text":"scales each ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-th dimension using the weight ","element":"span"},{"style":{"height":15.5},"width":42.23,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-24.png","element":"img","alt":" ω∗d","inline":true},{"text":", obtained such that all dimen- ","element":"span"},{"text":"sions share a similar Lipschitz, i.e.,","element":"span"}],[{"id":"id-29","style":{"width":"65%"},"width":1232,"height":132,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-25.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":16},"width":131.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-26.png","element":"img","alt":"�Ldi(ωd)","inline":true,"padRight":true},{"text":"are the scaled Lipschitz constants, as in Eq. ","element":"span"},{"href":"#id-28","text":"11, ","element":"a"},{"text":"and ","element":"span"},{"style":{"height":10.98},"width":43.12,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-27.png","element":"img","alt":" L∗","inline":true,"padRight":true},{"text":"the target ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness. In our experiments we set ","element":"span"},{"style":{"height":10.98},"width":43.12,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-28.png","element":"img","alt":" L∗","inline":true,"padRight":true},{"text":"to ","element":"span"},{"style":{"height":16},"width":131.09,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-29.png","element":"img","alt":" 1/(Dα)","inline":true},{"text":", where ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-30.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is the initial learning rate set by the practitioner. The motivation behind this choice is approximating the resulting overall likelihood ","element":"span"},{"style":{"height":14.83},"width":27,"height":37.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-31.png","element":"img","alt":"˜L","inline":true},{"text":"-smoothness to the one optimal for a given learning rate, being ","element":"span"},{"style":{"height":19.61},"width":563.24,"height":49.03,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-32.png","element":"img","alt":"˜L = �d ˜Ld ≈ �d 1/(Dα) = 1/α","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 1. ","element":"span"},{"text":"In our experiments, we use Proposition ","element":"span"},{"href":"#id-25","text":"3.1 ","element":"a"},{"text":"and automatic differentiation to approximate the local ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness, as well as closed-form solutions and root-finding methods to find the optimal scaling factors ","element":"span"},{"style":{"height":15.5},"width":42.24,"height":38.75,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-33.png","element":"img","alt":" ω∗d","inline":true,"padRight":true},{"text":"(details ","element":"span"},{"text":"in Appendix ","element":"span"},{"text":"D)","element":"span"},{"text":". However, we recall that gradient descent may be also used to solve the optimization problem in Eq. ","element":"span"},{"href":"#id-29","text":"14. ","element":"a"},{"text":"As a result, Lipschitz-standardization is applicable to other log-likelihood functions than the ones discussed above, as well as for different problems beyond BBVI.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark 2. ","element":"span"},{"text":"Our algorithm is a preprocessing step, and thus the Lipschitz standardized data ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-34.png","element":"img","alt":" �x","inline":true},{"text":", as well as the scaled likelihood functions ","element":"span"},{"style":{"height":16},"width":110.65,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-35.png","element":"img","alt":"�ℓd(�ηd)","inline":true},{"text":", are used to learn the model parameters (the variational parameters, in our case). However, during test and deployment, we ought come back to the original space of the data. This can be done, in the case of distributions in the exponential family (see Section ","element":"span"},{"href":"#id-30","text":"3.1) ","element":"a"},{"text":"by using Prop. ","element":"span"},{"href":"#id-25","text":"3.1, ","element":"a"},{"text":"which shows how to obtain the parameters of the original likelihood function as ","element":"span"},{"style":{"height":16},"width":244.16,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/5-36.png","element":"img","alt":" η = f(ω) ⊙ �η","inline":true},{"text":". Appendix ","element":"span"},{"text":"A ","element":"span"},{"text":"briefly sketches this idea, providing examples on how our approach applies to the distributions in Table ","element":"span"},{"href":"#id-26","text":"1 ","element":"a"},{"text":"and to discrete data, which we discuss next.","element":"span"}],[{"id":"id-40","style":{"fontWeight":"bold"},"text":"4.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Discrete data","element":"span"}],[{"text":"Up to this point, our algorithm only applies to continuous data and likelihood functions. However, real-world data often present mixed continuous and discrete data types, as well as likelihood models. Next, we extend the proposed","element":"span"}],[{"id":"id-34","style":{"width":"100%"},"width":1873,"height":569,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-0.png","element":"img"}],[{"text":"Figure 3: Missing imputation error across different datasets and models (lower is better). Each method appears only when applicable and it is shown in the same order as in the legend.","element":"figcaption","subtype":"caption"}],[{"text":"Lipschitz-standardization method to discrete data (represented using the natural numbers), assuming discrete distributions such as Bernoulli, Poisson and categorical distributions. We refer to this new approach as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Gamma trick","element":"span"},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Gamma Trick. ","element":"span"},{"text":"This approach (detailed in Appendix ","element":"span"},{"text":"A) ","element":"span"},{"text":"can be summarised in four steps: i) transform the discrete data ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"to continuous ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"via additive noise, i.e., ","element":"span"},{"style":{"height":10.4},"width":156.08,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-1.png","element":"img","alt":" x = x+ε","inline":true},{"text":", for which we assume a Gamma likelihood; ii) apply Lipschitz standardization to ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"to ease more balanced learning; iii) apply the learning process on the scaled data ","element":"span"},{"style":{"height":6.8},"width":23,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-2.png","element":"img","alt":" �x","inline":true,"padRight":true},{"text":"to learn the model parameters ","element":"span"},{"style":{"height":10.4},"width":21.76,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-3.png","element":"img","alt":" �η","inline":true},{"text":"; and iv) estimate the parameters of the original discrete distribution using the learned (un-)scaled continuous distribution.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Recovering the parameters of the discrete likelihood. ","element":"span"},{"text":"The Bernoulli and Poisson distributions are characterized by their expected value. Hence, to recover their distributional parameters for testing, it is enough to do mean matching between the original distribution and its (un-scaled) Gamma counterpart. Note that the mean of the discrete variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is given by ","element":"span"},{"style":{"height":16},"width":245.34,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-4.png","element":"img","alt":" µ = µ − E [ε]","inline":true},{"text":", where ","element":"span"},{"style":{"height":10},"width":24,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-5.png","element":"img","alt":" µ","inline":true,"padRight":true},{"text":"is the mean of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":", i.e., ","element":"span"},{"style":{"height":16},"width":68.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-6.png","element":"img","alt":" α/β","inline":true,"padRight":true},{"text":"under the (un-scaled) Gamma distribution with parameters ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-7.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-8.png","element":"img","alt":" β","inline":true},{"text":". Therefore, we estimate the mean of the Bernoulli distribution as ","element":"span"},{"style":{"height":16},"width":375.55,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-9.png","element":"img","alt":" p = max(0, min(1, µ))","inline":true},{"text":", and the rate of the Poisson distribution as ","element":"span"},{"style":{"height":16},"width":240.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-10.png","element":"img","alt":" λ = max(δ, µ)","inline":true},{"text":", where ","element":"span"},{"style":{"height":12.8},"width":174.27,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-11.png","element":"img","alt":" 0 < δ ≪ 1","inline":true,"padRight":true},{"text":"to ensure that ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-12.png","element":"img","alt":" λ","inline":true,"padRight":true},{"text":"is positive.","element":"span"}],[{"text":"As the categorical distribution has more than one parameter, a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Bernoulli trick ","element":"span"},{"text":"is applied before applying the Gamma trick. The Bernoulli trick assumes a one-hot representation of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K","element":"span"},{"text":"-dimensional categorical distribution and treat each class as an independent Bernoulli distribution, which as shown above is suitable for the Gamma trick. To recover the parameter of the categorical distribution ","element":"span"},{"style":{"height":16},"width":354.19,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-13.png","element":"img","alt":" π = (π1, π2, . . . , πK)","inline":true,"padRight":true},{"text":"we individually recover the mean of each Bernoulli class, ","element":"span"},{"style":{"height":10},"width":41.01,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-14.png","element":"img","alt":" µk","inline":true},{"text":", and make sure that they sum up to one, i.e., ","element":"span"},{"style":{"height":20.4},"width":296.34,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-15.png","element":"img","alt":" πk = µk/�Ki=1 µi","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"k ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , K","element":"span"},{"text":". Note that, when applying ","element":"span"},{"text":"Lipschitz standardization to the categorical distribution, we account for the fact that it has been divided in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"Gamma distributions. As we want all the observed dimensions to be ","element":"span"},{"style":{"height":10.98},"width":43.12,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-16.png","element":"img","alt":" L∗","inline":true},{"text":"-smooth, we group up the new ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"Gamma distributions and set their objective ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness to ","element":"span"},{"style":{"height":16},"width":101.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-17.png","element":"img","alt":" L∗/K","inline":true},{"text":", so that they add up to the same ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness, i.e., ","element":"span"},{"style":{"height":16.74},"width":59.06,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-18.png","element":"img","alt":"�k","inline":true,"padRight":true},{"text":"Ł","element":"span"},{"style":{"height":13.38},"width":115.86,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-19.png","element":"img","alt":"k = L∗","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Additive noise. ","element":"span"},{"text":"In our transformation from discrete data into continuous data, ","element":"span"},{"style":{"height":10.4},"width":152.06,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-20.png","element":"img","alt":" x = x+ε","inline":true},{"text":", we ensure that the continuous noise variable ","element":"span"},{"style":{"height":7.2},"width":19,"height":18,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-21.png","element":"img","alt":" ε","inline":true},{"text":": i) lies in a non-zero measure subset of the unit interval ","element":"span"},{"style":{"height":16},"width":175.32,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-22.png","element":"img","alt":" ε ∈ (0, 1)","inline":true,"padRight":true},{"text":"so that the original value is identifiable; ii) preserves the original data shape as much as possible; and iii) ensures that the shape parameter ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-23.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"of the Gamma is far from zero, and ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-24.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"does not become arbitrarily large (see Appendix ","element":"span"},{"text":"E ","element":"span"},{"text":"for further details). In our experiments we use noise ","element":"span"},{"style":{"height":16},"width":297.94,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/6-25.png","element":"img","alt":" ε ∼ Beta(1.1, 30)","inline":true},{"text":".","element":"span"}]]},{"heading":"5 Experiments","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Experimental setup ","element":"span"},{"text":"We use six different datasets from the UCI repository ","element":"span"},{"href":"#id-2","referenceIndex":5,"text":"(Dua and Graff, ","element":"a"},{"href":"#id-2","referenceIndex":5,"text":"2017) ","element":"a"},{"text":"and apply BBVI to fit three generative models: i) mixture model; ii) matrix factorization; and iii) (vanilla) VAE. Additionally, we pick a likelihood for each dimension based on its observable properties (e.g., positive real data or categorical data) and, to provide a fair initialization across all methods and datasets, continuous data is standardized beforehand. Appendix ","element":"span"},{"text":"F ","element":"span"},{"text":"contains further details and tabular results.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Methods. ","element":"span"},{"text":"We consider different combinations of continuous and discrete preprocessing, taking them in our naming nomenclature as prefix and suffix, respectively. Specifically, for continuous variables we use: i) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"std","element":"span"},{"text":", standardization; ii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"max","element":"span"},{"text":", normalization; iii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"iqr","element":"span"},{"text":", divides by the interquartile range; iv) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"lip","element":"span"},{"text":", Lipschitz standardization. And similarly we consider for discrete distributions: i) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"none","element":"span"},{"text":", leaves the discrete data as it is; ii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"bern","element":"span"},{"text":", applies the Bernoulli trick to","element":"span"}],[{"id":"id-33","style":{"width":"85%"},"width":1592,"height":616,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/7-0.png","element":"img"}],[{"text":"Figure 5: Per-dimension normalized error on the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Adult ","element":"figcaption","subtype":"caption"},{"text":"dataset. Top row: Matrix factorization. Bottom row: VAE. Note that all methods but ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma ","element":"figcaption","subtype":"caption"},{"text":"overlook a subset of the variables.","element":"figcaption","subtype":"caption"}],[{"text":"categorical data; iii) ","element":"span"},{"style":{"fontStyle":"italic"},"text":"gamma","element":"span"},{"text":", applies the Gamma trick to all discrete variables. As an example, the proposed method applies the Gamma trick to the discrete variables, and then Lipschitz standardizes all the data, so that it is denoted as ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma","element":"span"},{"text":".","element":"span"}],[{"id":"id-32","style":{"width":"52%"},"width":977,"height":988,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/7-1.png","element":"img"}],[{"text":"Figure 4: Per-dimension normalized error for different models on the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"Letter ","element":"figcaption","subtype":"caption"},{"text":"dataset. Dotted line represents the baseline. Values closer to the origin are better.","element":"figcaption","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Metric. ","element":"span"},{"text":"Analogously to ","element":"span"},{"href":"#id-31","referenceIndex":16,"text":"Nazabal et al. ","element":"a"},{"href":"#id-31","referenceIndex":16,"text":"(2018)","element":"a"},{"text":", we evaluate the performance of the methods in a data imputation tasks using average missing imputation error as evaluation metric. Specifically, normalized mean squared error is used for numerical variables and error rate for nominal ones. Besides, in Figures ","element":"span"},{"href":"#id-32","text":"4 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-33","text":"5, ","element":"a"},{"text":"we show the imputation error, normalized by the error obtained by mean imputation, for each dimension.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Results. ","element":"span"},{"text":"Figure ","element":"span"},{"href":"#id-34","text":"3 ","element":"a"},{"text":"summarizes the results averaged over three settings with ","element":"span"},{"text":"10","element":"span"},{"text":", ","element":"span"},{"text":"20","element":"span"},{"text":", and ","element":"span"},{"text":"50 % ","element":"span"},{"text":"of missing values— with 10 independent runs each—where outliers were removed for better visualization (more detailed results can be found in Appendix ","element":"span"},{"text":"G)","element":"span"},{"text":". We can distinguish two groups. ","element":"span"},{"text":"The first group corresponds to the methods that leave discrete data untouched, where we observe that the Lipschitz standardization (","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-none","element":"span"},{"text":") provides comparable results to the best of its counterparts (","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"max-none","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"std-none","element":"span"},{"text":", ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"iqr-none","element":"span"},{"text":"), being worth-mentioning the results of matrix factorization in ","element":"span"},{"style":{"fontStyle":"italic"},"text":"defaultCredit","element":"span"},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"std-none ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"iqr-none ","element":"span"},{"text":"completely disappear from the plot after removing outliers. Clearly, the second group of methods, which handle discrete variables using either the Bernoulli or Gamma trick, outperform the former group. This becomes particularly clear on highly heterogeneous datasets (e.g., ","element":"span"},{"style":{"fontStyle":"italic"},"text":"defaultCredit ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Adult","element":"span"},{"text":"), where we obtain—and occasionally beat—state-of-the-art results reported by ","element":"span"},{"href":"#id-31","referenceIndex":16,"text":"Nazabal et al. ","element":"a"},{"href":"#id-31","referenceIndex":16,"text":"(2018)","element":"a"},{"text":".","element":"span"}],[{"text":"We remark that, while results across models are consistent, the effect of data preprocessing directly depends on the model capacity and dataset complexity. Specifically, the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"mixture model ","element":"span"},{"text":"is too restrictive, finding the same optimum regardless of the preprocessing; ","element":"span"},{"style":{"fontStyle":"italic"},"text":"matrix factorization ","element":"span"},{"text":"has enough capacity to be greatly affected by the data (as shown in Figure ","element":"span"},{"href":"#id-34","text":"3)","element":"a"},{"text":"; and the VAE is as powerful as to overcome most of the differences in the preprocessing for simpler datasets, yet still being affected for the most complex datasets. This is nicely exemplified in Figure ","element":"span"},{"href":"#id-32","text":"4, ","element":"a"},{"text":"which shows per-dimension normalized error on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Letter ","element":"span"},{"text":"dataset, where we clearly observe the benefits of both the Bernoulli and Gamma tricks.","element":"span"}],[{"text":"Last but not least, the advantage of using Lipschitz-standardization, i.e. ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma","element":"span"},{"text":", compared with the other two competitive methods, ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-bern ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"std-gamma","element":"span"},{"text":", comes in the form of more consistent results for all datasets and independent runs, due to a more balanced learning. This can be easily seen by analyzing the per-dimension error of the most complex datasets—see Figure ","element":"span"},{"href":"#id-33","text":"5—","element":"a"},{"text":"where ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma ","element":"span"},{"text":"improves the overall imputation error across tasks without completely overlooking any variable. On the other hand, both ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-bern ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"std-gamma ","element":"span"},{"text":"overlook four different variables on the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Adult ","element":"span"},{"text":"dataset using two different models. This behavior is not exclusive of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Adult","element":"span"},{"text":", as Figures ","element":"span"},{"href":"#id-35","text":"10-","element":"a"},{"href":"#id-36","text":"11 ","element":"a"},{"text":"and Tables ","element":"span"},{"href":"#id-37","text":"4-","element":"a"},{"href":"#id-38","text":"6 ","element":"a"},{"text":"in Appendix ","element":"span"},{"text":"F ","element":"span"},{"text":"show. To tie everything up, we would like to point out that the illustrative example given in Section ","element":"span"},{"text":"1 ","element":"span"},{"text":"(Figures ","element":"span"},{"href":"#id-3","text":"1-","element":"a"},{"href":"#id-7","text":"2) ","element":"a"},{"text":"corresponds to a particular run from the bottom row.","element":"span"}]]},{"heading":"6 Conclusions","paragraphs":[[{"text":"In this work we have introduced the problem of balanced multivariate learning, which occurs when first-order optimization is used to perform approximate inference in multivariate probabilistic models, and which can be seen as a MTL problem. Then, since existing solutions for MTL problems do not seem to directly apply in the probabilistic setting, we have instead focused on data preprocessing as a simple and practical solution to mitigate unbalanced learning. In particular, we have shed new insights on the behaviour of data standardization, finding that it makes the smoothness of common continuous log-likelihoods comparable. Finally, we have proposed Lipschitz standardization, a data preprocessing algorithm that eases balanced multivariate learning by making the local ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness equal across all (discrete and continuous) dimensions of the data. Our experiments show that Lipschitz standardization outperforms existing methods, and specially shines when the data is highly heterogeneous.","element":"span"}],[{"text":"Interesting research avenues include the implementation of Lipschitz standardization in probabilistic programming pipelines, its use in settings different from BBVI (e.g., HMC), and extending this idea to an online algorithm embedded in the learning process, which takes the model into consideration and enables the fine-tune of the local Lipschitz during learning.","element":"span"}]]},{"heading":"References","paragraphs":[[{"id":"id-21","text":"Aksoy, S. and Haralick, R. M. (2001). Feature normalization and likelihood-based similarity measures for image ","element":"span"},{"text":"retrieval. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Pattern recognition letters","element":"span"},{"text":", 22(5):563–582.","element":"span"}],[{"text":"Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of the American statistical Association","element":"span"},{"text":", 112(518):859–877.","element":"span"}],[{"id":"id-5","text":"Chen, Z., Badrinarayanan, V., Lee, C.-Y., and Rabinovich, A. (2018). Gradnorm: Gradient normalization for adaptive ","element":"span"},{"text":"loss balancing in deep multitask networks. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"International Conference on Machine Learning","element":"span"},{"text":", pages 794–803. PMLR.","element":"span"}],[{"id":"id-1","text":"Diederik, P. K., Welling, M., et al. (2014). Auto-encoding variational bayes. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the International Conference on Learning Representations (ICLR)","element":"span"},{"text":", volume 1.","element":"span"}],[{"id":"id-2","text":"Dua, D. and Graff, C. (2017). UCI machine learning repository.","element":"span"}],[{"id":"id-18","text":"Gnanadesikan, R., Kettenring, J. R., and Tsao, S. L. (1995). Weighting and selection of variables for cluster analysis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of Classification","element":"span"},{"text":", 12(1):113–136.","element":"span"}],[{"id":"id-12","text":"Guo, M., Haque, A., Huang, D.-A., Yeung, S., and Fei-Fei, L. (2018). Dynamic task prioritization for multitask ","element":"span"},{"text":"learning. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the European Conference on Computer Vision (ECCV)","element":"span"},{"text":", pages 270–287.","element":"span"}],[{"id":"id-22","text":"Han, J., Pei, J., and Kamber, M. (2011). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Data mining: concepts and techniques","element":"span"},{"text":". Elsevier.","element":"span"}],[{"id":"id-8","text":"Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"The Journal of Machine Learning Research","element":"span"},{"text":", 14(1):1303–1347.","element":"span"}],[{"id":"id-24","text":"Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal ","element":"span"},{"text":"covariate shift. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1502.03167","element":"span"},{"text":".","element":"span"}],[{"id":"id-48","text":"Jang, E., Gu, S., and Poole, B. (2016). ","element":"span"},{"text":"Categorical reparameterization with gumbel-softmax. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1611.01144","element":"span"},{"text":".","element":"span"}],[{"id":"id-19","text":"Juszczak, P., Tax, D., and Duin, R. P. (2002). Feature scaling in support vector data description. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proc. asci","element":"span"},{"text":", pages 95–102. Citeseer.","element":"span"}],[{"id":"id-11","text":"Kendall, A., Gal, Y., and Cipolla, R. (2018). Multi-task learning using uncertainty to weigh losses for scene geometry ","element":"span"},{"text":"and semantics. In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Proceedings of the IEEE conference on computer vision and pattern recognition","element":"span"},{"text":", pages 7482– 7491.","element":"span"}],[{"id":"id-20","text":"Milligan, G. W. and Cooper, M. C. (1988). A study of standardization of variables in cluster analysis. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Journal of classification","element":"span"},{"text":", 5(2):181–204.","element":"span"}],[{"id":"id-14","text":"Milojkovic, N., Antognini, D., Bergamin, G., Faltings, B., and Musat, C. (2019). Multi-gradient descent for multi- ","element":"span"},{"text":"objective recommender systems. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:2001.00846","element":"span"},{"text":".","element":"span"}],[{"id":"id-31","text":"Nazabal, A., Olmos, P. M., Ghahramani, Z., and Valera, I. (2018). Handling incomplete heterogeneous data using ","element":"span"},{"text":"vaes. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1807.03653","element":"span"},{"text":".","element":"span"}],[{"id":"id-16","text":"Nesterov, Y. (2018). ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Lectures on convex optimization","element":"span"},{"text":", volume 137. Springer.","element":"span"}],[{"id":"id-0","text":"Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Artificial Intelligence and Statistics","element":"span"},{"text":", pages 814–822.","element":"span"}],[{"id":"id-4","text":"Ruder, S. (2017). An overview of multi-task learning in deep neural networks. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"arXiv preprint arXiv:1706.05098","element":"span"},{"text":".","element":"span"}],[{"id":"id-23","text":"Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help optimization? ","element":"span"},{"text":"In ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Advances in Neural Information Processing Systems","element":"span"},{"text":", pages 2483–2493.","element":"span"}]]},{"heading":"A Data workﬂow and Gamma trick","paragraphs":[[{"text":"It is important to bear in mind the transformation the data follows during the training procedure, as well as what we do with the data at each phase. To clarify this in our setting, we provide in Figure ","element":"span"},{"href":"#id-39","text":"6 ","element":"a"},{"text":"two diagrams describing this procedure for continuous and discrete variables, following the notation of the main paper. As a summary, data is transformed and scaled, and the scaled natural parameters are learned during training. Whenever evaluation is needed, these parameters are always returned to the space of the original data, that is, ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-0.png","element":"img","alt":" �η","inline":true,"padRight":true},{"text":"is transformed to ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-1.png","element":"img","alt":" η","inline":true,"padRight":true},{"text":"before evaluating ","element":"span"},{"id":"id-39","text":"on the space of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":".","element":"span"}],[{"style":{"width":"65%"},"width":1224,"height":266,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-2.png","element":"img"}],[{"text":"Figure 6: Schematic working flow used in this work. For training, data is transformed and their natural parameters are learned. To evaluate, the original parameters are recovered from the transformed ones.","element":"figcaption","subtype":"caption"}],[{"text":"To avoid confusion, let us clarify here what are the transformations described in Figure ","element":"span"},{"href":"#id-39","text":"6b ","element":"a"},{"text":"(the continuous case is included as a special case). The step ","element":"span"},{"style":{"height":10.79},"width":157.24,"height":26.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-3.png","element":"img","alt":" xd �→ xd","inline":true,"padRight":true},{"text":"refers to all the transformations regarding discrete data explained in Section ","element":"span"},{"href":"#id-40","text":"4.1 ","element":"a"},{"text":"of the main paper. Specifically, splitting a categorical variable into ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"independent Bernoulli ones in the case of the Bernoulli trick, and the addition of noise in the case of the Gamma trick. The transformation ","element":"span"},{"style":{"height":10.79},"width":158.18,"height":26.97,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-4.png","element":"img","alt":" xd �→ �xd","inline":true,"padRight":true},{"text":"refers to the data scaling procedure: standardization, normalization, Lipschitz standardization, etc. The orange arrow is the process performed by the model, which takes the input ","element":"span"},{"style":{"height":9.59},"width":43.26,"height":23.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-5.png","element":"img","alt":" �xd","inline":true,"padRight":true},{"text":"and outputs the parameters ","element":"span"},{"style":{"height":11.1},"width":42.39,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-6.png","element":"img","alt":" �ηd","inline":true},{"text":". Then, in ","element":"span"},{"style":{"height":12.3},"width":153.04,"height":30.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-7.png","element":"img","alt":" �ηd �→ ηd","inline":true},{"text":", the parameters are scaled back to their original size, using the relationship between natural parameters described in Proposition ","element":"span"},{"href":"#id-25","text":"3.1 ","element":"a"},{"text":"of the main paper. We do the transformation ","element":"span"},{"style":{"height":12.3},"width":151.17,"height":30.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-8.png","element":"img","alt":" ηd �→ ηd","inline":true,"padRight":true},{"text":"as described in Section ","element":"span"},{"href":"#id-40","text":"4.1 ","element":"a"},{"text":"of the main paper, that is, removing noise, clipping, and gathering the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"K ","element":"span"},{"text":"independent parameters into a dependent one as necessary. Finally, we can use those parameters ","element":"span"},{"style":{"height":11.1},"width":42.38,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-9.png","element":"img","alt":" ηd","inline":true,"padRight":true},{"text":"to evaluate the data coming from the same source as the original data.","element":"span"}],[{"text":"Something we have not discussed in the main paper regards the choice of the Gamma distribution as a proxy to learn the parameters of the Bernoulli and Poisson distributions. As counter-intuitive as it might seem at first, it turns out that the Gamma distribution is a great distribution for doing mean matching with respect to these distributions. To check this statement, we have run a simple Python code using ","element":"span"},{"style":{"fontFamily":"monospace"},"text":"scipy.stats ","element":"span"},{"text":"that: i) generates random samples from a Bernoulli (Poisson) distribution; ii) adds additive noise from a distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Beta","element":"span"},{"text":"(1","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"30)","element":"span"},{"text":"; iii) fits the data to a Gamma distribution and performs mean matching as explained before; and iv) computes the mean absolute difference between the estimated and real parameters. This procedure was performed for Bernoulli distributions with parameter ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"= ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i/","element":"span"},{"text":"50","element":"span"},{"text":", and Poisson distributions with parameter ","element":"span"},{"style":{"height":10.8},"width":97.54,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-10.png","element":"img","alt":" λ = i","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":157.12,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-11.png","element":"img","alt":" λ = i/50","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 0","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , ","element":"span"},{"text":"50","element":"span"},{"text":". The average error obtained was ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"0081 ","element":"span"},{"text":"and ","element":"span"},{"text":"0","element":"span"},{"style":{"fontStyle":"italic"},"text":".","element":"span"},{"text":"0712 ","element":"span"},{"text":"for the Bernoulli and Poisson distributions, respectively.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"A.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Illustrative example of data workflow","element":"span"}],[{"text":"We provide a simple example that shows how data is transformed and used throughout the entire process. Assume that we have two input dimensions, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"D ","element":"span"},{"text":"= 2","element":"span"},{"text":", whose distributions are assumed to be normal ","element":"span"},{"style":{"height":16.4},"width":260.67,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-12.png","element":"img","alt":" X1 ∼ N(µ, σ)","inline":true,"padRight":true},{"text":"and categorical with 3 classes ","element":"span"},{"style":{"height":16},"width":477.2,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-13.png","element":"img","alt":" X2 ∼ Cat(π = (π1, π2, π3))","inline":true},{"text":", respectively. Let us further suppose that we want to use ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma","element":"span"},{"text":", that is, Lipschitz-standardization combined with the Gamma trick. Then, we would not alter the first variable ","element":"span"},{"style":{"height":16.4},"width":364.42,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-14.png","element":"img","alt":"X1 = X1 ∼ N(µ, σ)","inline":true},{"text":", but substitute ","element":"span"},{"style":{"height":13.19},"width":49.02,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-15.png","element":"img","alt":" X2","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":18.3},"width":503.54,"height":45.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-16.png","element":"img","alt":" X2j = X2j + εj ∼ Γ(αj, βj)","inline":true},{"text":", where ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3 ","element":"span"},{"text":"are the indexes of the new variables, ","element":"span"},{"style":{"height":16.79},"width":282.78,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-17.png","element":"img","alt":" X2j ∼ Bern(pj)","inline":true,"padRight":true},{"text":"refers to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"text":"-th element of ","element":"span"},{"style":{"height":13.19},"width":49.01,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-18.png","element":"img","alt":" X2","inline":true,"padRight":true},{"text":"when considered its one-hot representation, and ","element":"span"},{"style":{"height":16.79},"width":314.72,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-19.png","element":"img","alt":"εj ∼ Beta(1.1, 30)","inline":true,"padRight":true},{"text":"is the (independent) additive noise variable.","element":"span"}],[{"text":"Now, we can scale transform all variables, thus obtaining the new scaled variables ","element":"span"},{"style":{"height":16.4},"width":389.85,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-20.png","element":"img","alt":"�X1 = ω1X1 ∼ N(�µ, �σ)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":15.59},"width":107.76,"height":38.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-21.png","element":"img","alt":"�X2j =","inline":true},{"style":{"height":16.79},"width":304.28,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-22.png","element":"img","alt":"ω2jX2j ∼ Γ(�α, �β)","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"3","element":"span"},{"text":". After training—or whenever we need to evaluate the model in non-training data— we ought to return to the original probabilistic model ","element":"span"},{"style":{"height":14},"width":117.62,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-23.png","element":"img","alt":" X1, X2","inline":true},{"text":". When recovering the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"X ","element":"span"},{"text":"variables, we need to use Proposition ","element":"span"},{"href":"#id-25","text":"3.1 ","element":"a"},{"text":"so that ","element":"span"},{"style":{"height":16},"width":266.89,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-24.png","element":"img","alt":" ηi = fi(ω) ⊙ �ηi","inline":true},{"text":", where we have obtained ","element":"span"},{"style":{"height":11.1},"width":36.38,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-25.png","element":"img","alt":" �ηi","inline":true,"padRight":true},{"text":"as the output of our model.","element":"span"}],[{"text":"To finally recover the original variables, ","element":"span"},{"style":{"height":14},"width":117.62,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-26.png","element":"img","alt":" X1, X2","inline":true},{"text":", we do not need to do anything to ","element":"span"},{"style":{"height":13.19},"width":49.01,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-27.png","element":"img","alt":" X1","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":13.19},"width":156.17,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-28.png","element":"img","alt":" X1 = X1","inline":true},{"text":". For the second variable, we obtain ","element":"span"},{"style":{"height":16.79},"width":280.95,"height":41.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-29.png","element":"img","alt":" X2j ∼ Bern(pj)","inline":true,"padRight":true},{"text":"as","element":"span"}],[{"style":{"width":"65%"},"width":1232,"height":51,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/10-30.png","element":"img"}],[{"style":{"width":"67%"},"width":1264,"height":73,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-0.png","element":"img"}]]},{"heading":"B Basic properties of L-smoothness","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"Proposition B.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If a real-valued function ","element":"span"},{"style":{"height":16},"width":75.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-1.png","element":"img","alt":" ℓ(η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-2.png","element":"img","alt":" Li","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.4},"width":30.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-3.png","element":"img","alt":" ηi","inline":true},{"style":{"fontStyle":"italic"},"text":", the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i","element":"span"},{"style":{"fontStyle":"italic"},"text":"-th parameter of ","element":"span"},{"style":{"height":16.98},"width":118.42,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-4.png","element":"img","alt":" η ∈ RI","inline":true},{"style":{"fontStyle":"italic"},"text":", for all ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , I","element":"span"},{"style":{"fontStyle":"italic"},"text":", then ","element":"span"},{"style":{"height":0},"width":20,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-5.png","element":"img","alt":" ℓ","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":16.74},"width":100.1,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-6.png","element":"img","alt":"�i Li","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-7.png","element":"img","alt":" η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"(assuming the ","element":"span"},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"-norm).","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof. ","element":"span"},{"text":"Consider two arbitrary ","element":"span"},{"style":{"height":16.59},"width":157.96,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-8.png","element":"img","alt":" a, b ∈ RI","inline":true},{"text":". Then, by assumption, ","element":"span"},{"style":{"height":18.34},"width":545.93,"height":45.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-9.png","element":"img","alt":" |∂ηiℓ(a)−∂ηiℓ(b)| ≤ Li||a − b||","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , I ","element":"span"},{"text":"and","element":"span"}],[{"style":{"width":"79%"},"width":1484,"height":137,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-10.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proposition B.2. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"If two real-valued functions ","element":"span"},{"style":{"height":16},"width":91.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-11.png","element":"img","alt":" ℓ1(η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16},"width":91.36,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-12.png","element":"img","alt":" ℓ2(η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"are ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-13.png","element":"img","alt":" L1","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth and ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-14.png","element":"img","alt":" L2","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-15.png","element":"img","alt":" η","inline":true},{"style":{"fontStyle":"italic"},"text":", respectively, then ","element":"span"},{"style":{"height":11.59},"width":115.79,"height":28.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-16.png","element":"img","alt":" ℓ1 + ℓ2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.19},"width":136.82,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-17.png","element":"img","alt":" L1 + L2","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth with respect to ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-18.png","element":"img","alt":" η","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof. ","element":"span"},{"text":"Consider two arbitrary ","element":"span"},{"style":{"height":16.58},"width":157.96,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-19.png","element":"img","alt":" a, b ∈ RI","inline":true},{"text":". Then,","element":"span"}],[{"style":{"width":"89%"},"width":1671,"height":223,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-20.png","element":"img"}]]},{"heading":"C Exponential family","paragraphs":[[{"text":"As stated in the main paper, the exponential family is characterized for having the form","element":"span"}],[{"style":{"width":"72%"},"width":1364,"height":50,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-21.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":11.1},"width":62.08,"height":27.74,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-22.png","element":"img","alt":" ηnd","inline":true,"padRight":true},{"text":"are the natural parameters, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"T","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"the sufficient statistics, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"is the base measure, and ","element":"span"},{"style":{"height":16},"width":86.77,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-23.png","element":"img","alt":" A(η)","inline":true,"padRight":true},{"text":"the log-partition function.","element":"span"}],[{"text":"To ease the task of transforming between natural (","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-24.png","element":"img","alt":"η","inline":true},{"text":") and usual (","element":"span"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-25.png","element":"img","alt":"θ","inline":true},{"text":") parameters, we provide in Table ","element":"span"},{"href":"#id-41","text":"2 ","element":"a"},{"text":"a cheat-sheet with the relationship between them for the distributions used along the paper, as well as the way that natural parameters are scaled with respect to the scaling factor ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-26.png","element":"img","alt":" ω","inline":true},{"text":".","element":"span"}],[{"text":"Regarding the relation between scaled and original data in the exponential family, we now prove a more general version of Proposition ","element":"span"},{"href":"#id-25","text":"3.1 ","element":"a"},{"text":"from the main text.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition C.1. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":117.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-27.png","element":"img","alt":" p(x; η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"be a density function of the exponential family where ","element":"span"},{"style":{"height":11.6},"width":203.14,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-28.png","element":"img","alt":" x ∈ X ⊂ R","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":16.98},"width":215.89,"height":42.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-29.png","element":"img","alt":" η ∈ Q ⊂ RI","inline":true},{"style":{"fontStyle":"italic"},"text":". Assume a bijective scaling function ","element":"span"},{"style":{"height":13.38},"width":298.88,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-30.png","element":"img","alt":" �x : X × R+ → X","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that for any ","element":"span"},{"style":{"height":13.78},"width":131.14,"height":34.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-31.png","element":"img","alt":" ω ∈ R+","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"it defines the function (and random variable) ","element":"span"},{"style":{"height":16},"width":221.98,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-32.png","element":"img","alt":" �xω = �x(x, ω)","inline":true},{"style":{"fontStyle":"italic"},"text":". If all sufficient statistics factorize as ","element":"span"},{"style":{"height":16},"width":488.38,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-33.png","element":"img","alt":" Ti(�xω) = fi(ω)Ti(x) + gi(ω)","inline":true},{"style":{"fontStyle":"italic"},"text":", then by defining ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-34.png","element":"img","alt":" �η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"such that ","element":"span"},{"style":{"height":16},"width":235.5,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-35.png","element":"img","alt":" η = f(ω) ⊙ �η","inline":true},{"style":{"fontStyle":"italic"},"text":", where ","element":"span"},{"style":{"height":16},"width":330.51,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-36.png","element":"img","alt":" f = (f1, f2, . . . , fI)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":10.4},"width":31,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-37.png","element":"img","alt":" ⊙","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is the element-wise multiplication, we have","element":"span"}],[{"style":{"width":"77%"},"width":1451,"height":54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-38.png","element":"img"}],[{"style":{"fontStyle":"italic"},"text":"where ","element":"span"},{"style":{"height":22.38},"width":48.58,"height":55.95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-39.png","element":"img","alt":" ∂j�ηi","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"denotes the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j","element":"span"},{"style":{"fontStyle":"italic"},"text":"th-partial derivative with respect to ","element":"span"},{"style":{"height":10.4},"width":30.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-40.png","element":"img","alt":" �ηi","inline":true},{"style":{"fontStyle":"italic"},"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proof. ","element":"span"},{"text":"First we are going to relate the normalization constants ","element":"span"},{"style":{"height":16},"width":82.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-41.png","element":"img","alt":" A(�η)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":82.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-42.png","element":"img","alt":" A(η)","inline":true,"padRight":true},{"text":"of ","element":"span"},{"style":{"height":16},"width":194.44,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-43.png","element":"img","alt":" log p(�xω; �η)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":171.37,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-44.png","element":"img","alt":" log p(x; η)","inline":true},{"text":", respectively:","element":"span"}],[{"style":{"width":"87%"},"width":1636,"height":310,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/11-45.png","element":"img"}],[{"id":"id-41","text":"Table 2: Relationship between parameters ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":22,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-0.png","element":"img","alt":" θ","inline":true,"padRight":true},{"text":"and natural parameters ","element":"figcaption","subtype":"caption"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-1.png","element":"img","alt":" η","inline":true},{"text":", as well as the way the latter scale (see Proposition ","element":"figcaption","subtype":"caption"},{"href":"#id-25","text":"3.1 ","element":"a","subtype":"caption"},{"text":"of the main text) for different distributions of the exponential family.","element":"figcaption","subtype":"caption"}],[{"style":{"width":"85%"},"width":1606,"height":1179,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-2.png","element":"img"}],[{"text":"We can safely divide by ","element":"span"},{"style":{"fontStyle":"italic"},"text":"h","element":"span"},{"text":"(","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"since it is the Radon-Nikodym derivative ","element":"span"},{"style":{"height":21.63},"width":88.36,"height":54.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-3.png","element":"img","alt":"dH(x)dx","inline":true,"padRight":true},{"text":"and we can assume that is non-zero almost everywhere in the domain of the likelihood.","element":"span"}],[{"style":{"width":"99%"},"width":1869,"height":187,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-4.png","element":"img"}],[{"text":"By denoting ","element":"span"},{"style":{"height":16},"width":124.29,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-5.png","element":"img","alt":" ϕ(x, ω)","inline":true,"padRight":true},{"text":"everything that is not ","element":"span"},{"style":{"height":16},"width":113.25,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-6.png","element":"img","alt":" p(x; η)","inline":true,"padRight":true},{"text":"in the previous equation we have that:","element":"span"}],[{"style":{"width":"66%"},"width":1254,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-7.png","element":"img"}],[{"text":"Now, for the case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j ","element":"span"},{"text":"= 1 ","element":"span"},{"text":"we just have to use the chain rule and the fact that ","element":"span"},{"style":{"height":16},"width":124.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-8.png","element":"img","alt":" ϕ(x, ω)","inline":true,"padRight":true},{"text":"does not depend on ","element":"span"},{"style":{"height":10.4},"width":30.79,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-9.png","element":"img","alt":" ηi","inline":true},{"text":":","element":"span"}],[{"style":{"width":"90%"},"width":1703,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-10.png","element":"img"}],[{"text":"And we can just prove the case ","element":"span"},{"style":{"fontStyle":"italic"},"text":"j > ","element":"span"},{"text":"1 ","element":"span"},{"text":"by induction:","element":"span"}],[{"style":{"width":"94%"},"width":1775,"height":128,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-11.png","element":"img"}]]},{"heading":"D Finding optimal scaling factors for common distributions","paragraphs":[[{"text":"In this section we show some results on how to find the optimal scaling factor ","element":"span"},{"style":{"height":9.19},"width":41.81,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-12.png","element":"img","alt":" ωd","inline":true,"padRight":true},{"text":"solving the problem described in Equation ","element":"span"},{"href":"#id-29","text":"14 ","element":"a"},{"text":"of the main paper. For completeness, let us recall the problem:","element":"span"}],[{"id":"id-42","style":{"width":"84%"},"width":1587,"height":133,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/12-13.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":13.19},"width":44.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-0.png","element":"img","alt":"�Ld","inline":true,"padRight":true},{"text":"is the Lipschitz constant corresponding to the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness of the scaled ","element":"span"},{"style":{"fontStyle":"italic"},"text":"d","element":"span"},{"text":"-th dimension, and ","element":"span"},{"style":{"height":11.79},"width":118.58,"height":29.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-1.png","element":"img","alt":" L∗ > 0","inline":true,"padRight":true},{"text":"is the smoothness goal that we attempt to achieve (as described in the main text).","element":"span"}],[{"text":"For common distributions we are able to give some guarantees. Specifically, we can obtain closed-form solutions for the exponential and Gamma distributions, whereas for the (log-)normal distribution we prove the existence and uniqueness of the optimal ","element":"span"},{"style":{"height":9.19},"width":41.8,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-2.png","element":"img","alt":" ωd","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Remark ","element":"span"},{"text":"We use throughout the proofs the well-known result that ","element":"span"},{"style":{"height":18.34},"width":393.55,"height":45.85,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-3.png","element":"img","alt":" ∂ηiA(η) = E [Ti(x)]","inline":true,"padRight":true},{"text":"for any ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":", . . . , I ","element":"span"},{"text":"in the case of the exponential family. ","element":"span"},{"text":"Therefore, ","element":"span"},{"style":{"height":19.18},"width":479.78,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-4.png","element":"img","alt":" Li = �j ∂ηj∂ηi log p(x; η)","inline":true,"padRight":true},{"text":"can be rewritten as ","element":"span"},{"style":{"height":19.18},"width":736.95,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-5.png","element":"img","alt":"Li = �j ∂ηj Eη [Ti(x)] = �j ∂ηi Eη [Tj(x)]","inline":true},{"text":", where the last equality is a direct consequence of Young’s theorem.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Proposition D.1 ","element":"span"},{"text":"(Exponential distribution)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":218.56,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-6.png","element":"img","alt":" X ∼ Exp(λ)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":20.4},"width":238.46,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-7.png","element":"img","alt":" X = {xn}Nn=1","inline":true},{"style":{"fontStyle":"italic"},"text":". Suppose that, for some value ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-8.png","element":"img","alt":" �η","inline":true},{"style":{"fontStyle":"italic"},"text":", it ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds that ","element":"span"},{"style":{"height":16},"width":188.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-9.png","element":"img","alt":" log p(X; �η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-10.png","element":"img","alt":" Li","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth w.r.t. ","element":"span"},{"style":{"height":12.4},"width":106.06,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-11.png","element":"img","alt":" ηi ∈ η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then the solution for problem ","element":"span"},{"href":"#id-42","style":{"fontStyle":"italic"},"text":"23 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"always exists, is unique, and can be written as","element":"span"}],[{"style":{"width":"55%"},"width":1036,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-12.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof. ","element":"span"},{"text":"The minimum of problem ","element":"span"},{"href":"#id-42","text":"23 ","element":"a"},{"text":"happens when ","element":"span"},{"style":{"height":20.4},"width":240.79,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-13.png","element":"img","alt":"�Ii=1 �Li = L∗","inline":true},{"text":". In this particular case, when ","element":"span"},{"style":{"height":13.37},"width":143.08,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-14.png","element":"img","alt":" �L1 = L∗","inline":true},{"text":". As show ","element":"span"},{"text":"in Equation ","element":"span"},{"href":"#id-43","text":"10 ","element":"a"},{"text":"from the main paper, we know that ","element":"span"},{"style":{"height":19.18},"width":491.13,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-15.png","element":"img","alt":"�Li(ω) = |fi(ω)| �j|fj(ω)|Li","inline":true,"padRight":true},{"text":"for the ","element":"span"},{"text":"1","element":"span"},{"text":"-norm. In our particular","element":"span"}],[{"style":{"width":"87%"},"width":1635,"height":131,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-16.png","element":"img"}],[{"text":"To show that ","element":"span"},{"style":{"height":10.98},"width":42.24,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-17.png","element":"img","alt":" ω∗","inline":true,"padRight":true},{"text":"always exists we only have to show that ","element":"span"},{"style":{"height":13.19},"width":118.13,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-18.png","element":"img","alt":" L1 > 0","inline":true,"padRight":true},{"text":"in all cases, which can easily shown:","element":"span"}],[{"style":{"width":"51%"},"width":972,"height":162,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-19.png","element":"img"}],[{"text":"and ","element":"span"},{"style":{"height":20.76},"width":365.66,"height":51.9,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-20.png","element":"img","alt":" L1 = |∂2η1| = η−21 > 0","inline":true,"padRight":true},{"text":"since ","element":"span"},{"style":{"height":14.4},"width":110.8,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-21.png","element":"img","alt":" η1 > 0","inline":true,"padRight":true},{"text":"by definition.","element":"span"}],[{"style":{"width":"6%"},"width":116,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-22.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proposition D.2 ","element":"span"},{"text":"(Gamma distribution)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":16},"width":224.53,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-23.png","element":"img","alt":" X ∼ Γ(α, β)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":20.4},"width":249.32,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-24.png","element":"img","alt":" X = {xn}Nn=1","inline":true},{"style":{"fontStyle":"italic"},"text":". Suppose that, for some value ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-25.png","element":"img","alt":" �η","inline":true},{"style":{"fontStyle":"italic"},"text":", it ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds that ","element":"span"},{"style":{"height":16},"width":188.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-26.png","element":"img","alt":" log p(X; �η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-27.png","element":"img","alt":" Li","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth w.r.t. ","element":"span"},{"style":{"height":12.4},"width":108.85,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-28.png","element":"img","alt":" ηi ∈ η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then the solution for problem ","element":"span"},{"href":"#id-42","style":{"fontStyle":"italic"},"text":"23 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"exists if ","element":"span"},{"style":{"height":13.38},"width":144.78,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-29.png","element":"img","alt":" L∗ > L1","inline":true},{"style":{"fontStyle":"italic"},"text":", is unique, and can be written as","element":"span"}],[{"style":{"width":"69%"},"width":1295,"height":148,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-30.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof. ","element":"span"},{"text":"As in the exponential case, we want to solve the equation ","element":"span"},{"style":{"height":16},"width":362.74,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-31.png","element":"img","alt":"�L1(ω) + �L2(ω) = L∗.","inline":true}],[{"style":{"width":"74%"},"width":1394,"height":113,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-32.png","element":"img"}],[{"text":"Therefore we need to find the roots of the polynomial ","element":"span"},{"style":{"height":17.38},"width":604.78,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-33.png","element":"img","alt":" L2ω2 + (L1 + L2)ω + L1 − L∗ = 0","inline":true},{"text":". To find the roots, let us denote the discriminant as ","element":"span"},{"style":{"height":17.38},"width":558.19,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-34.png","element":"img","alt":" ∆ = (L1 + L2)2 − 4L2(L1 − L∗)","inline":true},{"text":". Note that we can simplify ","element":"span"},{"style":{"height":11.6},"width":33,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-35.png","element":"img","alt":" ∆","inline":true},{"text":":","element":"span"}],[{"style":{"width":"65%"},"width":1221,"height":109,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-36.png","element":"img"}],[{"text":"The roots ","element":"span"},{"style":{"height":6.8},"width":25,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-37.png","element":"img","alt":" ω","inline":true,"padRight":true},{"text":"are given by","element":"span"}],[{"style":{"width":"83%"},"width":1569,"height":289,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/13-38.png","element":"img"}],[{"text":"If ","element":"span"},{"style":{"height":13.38},"width":141.69,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-0.png","element":"img","alt":" L∗ > L1","inline":true,"padRight":true},{"text":"we can again show that the solution always exists by computing ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-1.png","element":"img","alt":" L2","inline":true},{"text":":","element":"span"}],[{"style":{"width":"71%"},"width":1346,"height":440,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-2.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proposition D.3 ","element":"span"},{"text":"(Normal distribution)","element":"span"},{"style":{"fontWeight":"bold"},"text":". ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Let ","element":"span"},{"style":{"height":17.38},"width":250.07,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-3.png","element":"img","alt":" X ∼ N(µ, σ2)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"and ","element":"span"},{"style":{"height":20.4},"width":245.38,"height":50.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-4.png","element":"img","alt":" X = {xn}Nn=1","inline":true},{"style":{"fontStyle":"italic"},"text":". Suppose that, for some value ","element":"span"},{"style":{"height":10.8},"width":24,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-5.png","element":"img","alt":" �η","inline":true},{"style":{"fontStyle":"italic"},"text":", it ","element":"span"},{"style":{"fontStyle":"italic"},"text":"holds that ","element":"span"},{"style":{"height":16},"width":188.9,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-6.png","element":"img","alt":" log p(X; �η)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"is ","element":"span"},{"style":{"height":13.19},"width":38.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-7.png","element":"img","alt":" Li","inline":true},{"style":{"fontStyle":"italic"},"text":"-smooth w.r.t. ","element":"span"},{"style":{"height":12.4},"width":117.75,"height":31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-8.png","element":"img","alt":" ηi ∈ η","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"i ","element":"span"},{"text":"= 1","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"text":"2","element":"span"},{"style":{"fontStyle":"italic"},"text":". Then the solution for problem ","element":"span"},{"href":"#id-42","style":{"fontStyle":"italic"},"text":"23 ","element":"a"},{"style":{"fontStyle":"italic"},"text":"always exists, is unique, and can be expressed as the unique positive root of","element":"span"}],[{"style":{"width":"69%"},"width":1301,"height":46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-9.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Proof. ","element":"span"},{"text":"First, note that ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-10.png","element":"img","alt":" L2","inline":true,"padRight":true},{"text":"is always positive. To show that we calculate it approximation once again:","element":"span"}],[{"style":{"width":"64%"},"width":1203,"height":391,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-11.png","element":"img"}],[{"text":"We have that ","element":"span"},{"style":{"height":13.19},"width":120.05,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-12.png","element":"img","alt":" L2 > 0","inline":true,"padRight":true},{"text":"since the second term is only zero when ","element":"span"},{"style":{"height":14},"width":99.07,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-13.png","element":"img","alt":" µ = 0","inline":true,"padRight":true},{"text":"and, if that is the case, ","element":"span"},{"style":{"height":14.4},"width":112.73,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-14.png","element":"img","alt":" η1 = 0","inline":true,"padRight":true},{"text":"and the first term is positive.","element":"span"}],[{"style":{"width":"77%"},"width":1454,"height":122,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-15.png","element":"img"}],[{"text":"This is equivalent to finding the positive roots of ","element":"span"},{"style":{"height":17.39},"width":761.59,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-16.png","element":"img","alt":" Q(ω) = L2ω4 + (L1 + L2)ω3 + L1ω2 − L∗","inline":true},{"text":". Then let us call ","element":"span"},{"style":{"height":17.39},"width":366.53,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-17.png","element":"img","alt":"P(ω) = L2ω4 + L1ω2","inline":true,"padRight":true},{"text":"so that ","element":"span"},{"style":{"height":17.39},"width":584.56,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-18.png","element":"img","alt":" Q(ω) = P(ω) + (L1 + L2)ω3 − L∗","inline":true},{"text":".","element":"span"}],[{"text":"Note that there exists a unique positive solution of the equation ","element":"span"},{"style":{"height":16},"width":191,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-19.png","element":"img","alt":" P(ω) = Gi","inline":true,"padRight":true},{"text":"with ","element":"span"},{"style":{"height":13.19},"width":124.92,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-20.png","element":"img","alt":" Gi > 0","inline":true},{"text":". In fact, the only positive root of ","element":"span"},{"style":{"height":15.78},"width":317.96,"height":39.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-21.png","element":"img","alt":" L2ω4 + L1ω2 − Gi","inline":true,"padRight":true},{"text":"is","element":"span"}],[{"id":"id-44","style":{"width":"65%"},"width":1234,"height":102,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-22.png","element":"img"}],[{"text":"Define ","element":"span"},{"style":{"height":13.38},"width":145.46,"height":33.44,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-23.png","element":"img","alt":" G0 = L∗","inline":true},{"text":". As just pointed out, there exists a unique ","element":"span"},{"style":{"height":13.19},"width":115.82,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-24.png","element":"img","alt":" ω1 > 0","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":205.26,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-25.png","element":"img","alt":" P(ω1) = G0","inline":true},{"text":". Then","element":"span"}],[{"style":{"width":"91%"},"width":1710,"height":209,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-26.png","element":"img"}],[{"text":"since ","element":"span"},{"style":{"height":13.19},"width":149.68,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-27.png","element":"img","alt":" G1 < G0","inline":true},{"text":", the discriminant of Equation ","element":"span"},{"href":"#id-44","text":"27 ","element":"a"},{"text":"is smaller in the case of ","element":"span"},{"style":{"height":13.19},"width":47.33,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-28.png","element":"img","alt":" G1","inline":true,"padRight":true},{"text":"and thus ","element":"span"},{"style":{"height":11.19},"width":136.62,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-29.png","element":"img","alt":" ω2 < ω1","inline":true},{"text":".","element":"span"}],[{"style":{"width":"84%"},"width":1577,"height":180,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-30.png","element":"img"}],[{"text":"We can now find ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ω","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-31.png","element":"img","alt":"2","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"< ω","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-32.png","element":"img","alt":"3","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"< ω","element":"span"},{"style":{"height":7.6},"width":16,"height":19,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-33.png","element":"img","alt":"1","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"P","element":"span"},{"style":{"height":16},"width":184.6,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-34.png","element":"img","alt":"(ω3) = G2","inline":true},{"text":", ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":17.38},"width":483.76,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-35.png","element":"img","alt":"(ω3) = (L1 + L2)(ω33 − ω32)","inline":true},{"text":". Note that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"ω","element":"span"},{"style":{"height":17.34},"width":17.43,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-36.png","element":"img","alt":"31","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"> ω","element":"span"},{"style":{"height":17.34},"width":75.6,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-37.png","element":"img","alt":"33 ⇒","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"ω","element":"span"},{"style":{"height":17.34},"width":110.25,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-38.png","element":"img","alt":"31 + ω32","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"> ω","element":"span"},{"style":{"height":17.34},"width":123.53,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-39.png","element":"img","alt":"33 ⇒ ω31","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"> ω","element":"span"},{"style":{"height":17.34},"width":110.25,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-40.png","element":"img","alt":"33 − ω32","inline":true},{"text":", meaning that ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q","element":"span"},{"style":{"height":16},"width":74.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-41.png","element":"img","alt":"(ω3)","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"< Q","element":"span"},{"style":{"height":16},"width":74.18,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/14-42.png","element":"img","alt":"(ω1)","inline":true},{"text":".","element":"span"}],[{"text":"Thus far, we have built a sequence such that ","element":"span"},{"style":{"height":16},"width":527.58,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-0.png","element":"img","alt":" Q(ω2) < 0 < Q(ω3) < Q(ω1)","inline":true},{"text":". If we follow the process and define ","element":"span"},{"style":{"height":17.39},"width":538.38,"height":43.47,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-1.png","element":"img","alt":"G3 = G2 − (L1 + L2)(ω33 − ω32)","inline":true,"padRight":true},{"text":"we will find an ","element":"span"},{"style":{"height":11.19},"width":232.44,"height":27.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-2.png","element":"img","alt":" ω2 < ω4 < ω3","inline":true,"padRight":true},{"text":"such that ","element":"span"},{"style":{"height":16},"width":653.68,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-3.png","element":"img","alt":" Q(ω2) < Q(ω4) < 0 < Q(ω3) < Q(ω1)","inline":true},{"text":".","element":"span"}],[{"text":"Finally, let us define the sequence of intervals ","element":"span"},{"style":{"height":16},"width":383.48,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-4.png","element":"img","alt":" Ii = [Q(ωi+1), Q(ωi)]","inline":true,"padRight":true},{"text":"for ","element":"span"},{"style":{"height":14},"width":271.27,"height":35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-5.png","element":"img","alt":" i = 1, 2, . . . , ∞","inline":true,"padRight":true},{"text":"constructed using the described procedure. This sequence is a strictly decreasing nested sequence of non-empty compact subsets of ","element":"span"},{"text":"R","element":"span"},{"text":". Therefore, Cantor’s intersection theorem states that the intersection of these intervals is non-empty, ","element":"span"},{"style":{"height":16.4},"width":156.16,"height":41,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-6.png","element":"img","alt":" ∩iIi ̸= ∅","inline":true},{"text":", and since the only element which is in all the intervals is ","element":"span"},{"style":{"height":16},"width":223.46,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-7.png","element":"img","alt":" 0, ∩iIi = {0}","inline":true},{"text":".","element":"span"}],[{"text":"The sequence ","element":"span"},{"style":{"height":18},"width":208.05,"height":44.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-8.png","element":"img","alt":" {Q(ω2i)}∞i=1","inline":true,"padRight":true},{"text":"(","element":"span"},{"style":{"height":18},"width":248.39,"height":44.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-9.png","element":"img","alt":"{Q(ω2i+1)}∞i=1","inline":true},{"text":") converges to ","element":"span"},{"text":"0 ","element":"span"},{"text":"since it is a strictly decreasing (increasing) sequence ","element":"span"},{"text":"lower-bounded (upper-bounded) by ","element":"span"},{"text":"0","element":"span"},{"text":". The sequences of their anti-images, ","element":"span"},{"style":{"height":18},"width":145.55,"height":44.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-10.png","element":"img","alt":" {ω2i}∞i=1","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":18},"width":185.9,"height":44.99,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-11.png","element":"img","alt":" {ω2i+1}∞i=1","inline":true},{"text":", converge then ","element":"span"},{"text":"to the same value, ","element":"span"},{"style":{"height":10.99},"width":42.24,"height":27.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-12.png","element":"img","alt":" ω∗","inline":true},{"text":", the root of ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Q ","element":"span"},{"text":"and the solution of problem ","element":"span"},{"href":"#id-42","text":"23.","element":"a"}],[{"style":{"width":"6%"},"width":116,"height":32,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-13.png","element":"img"}]]},{"heading":"E L-smoothness estimation","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"E.1 ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"style":{"fontWeight":"bold"},"text":"-smoothness after standardization","element":"span"}],[{"text":"Similar to what we have done in Appendix ","element":"span"},{"text":"D, ","element":"span"},{"text":"here we are going to compute the estimator of the local ","element":"span"},{"style":{"fontStyle":"italic"},"text":"L","element":"span"},{"text":"-smoothness for some usual distributions using Ł ","element":"span"},{"style":{"height":16.74},"width":97.24,"height":41.84,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-14.png","element":"img","alt":" = �i","inline":true,"padRight":true},{"text":"Ł","element":"span"},{"style":{"height":19.18},"width":501.84,"height":47.96,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-15.png","element":"img","alt":"i = �i�j|∂ηj∂ηi log p(x; η)|","inline":true},{"text":", and then see how this smoothness changes ","element":"span"},{"text":"as we scale by ","element":"span"},{"style":{"height":16},"width":173.7,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-16.png","element":"img","alt":" ω = 1/std","inline":true},{"text":". We will use here the standard deviation expression of each particular likelihood, therefore these results hold as long as the selected likelihood properly fits the data.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"(Log-)Normal distribution ","element":"span"},{"text":"First, we compute the partial derivatives of the log-likelihood:","element":"span"}],[{"style":{"width":"77%"},"width":1448,"height":524,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-17.png","element":"img"}],[{"text":"Therefore, we have that ","element":"span"},{"style":{"height":17.39},"width":292.14,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-18.png","element":"img","alt":" L1 ≈ σ2 + 2|µ|σ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":432.96,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-19.png","element":"img","alt":" L2 ≈ 2σ2(|µ| + σ2 + 2µ2)","inline":true},{"text":". After standardizing the data, we have that ","element":"span"},{"style":{"height":16},"width":144.1,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-20.png","element":"img","alt":"�µ = µ/σ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":13.39},"width":115.23,"height":33.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-21.png","element":"img","alt":" �σ2 = 1","inline":true},{"text":", resulting in ","element":"span"},{"style":{"height":21.63},"width":256.1,"height":54.06,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-22.png","element":"img","alt":"�Lstd1 = 1 + 2 |µ|σ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":21.63},"width":399.1,"height":54.07,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-23.png","element":"img","alt":" �Lstd2 = 4| µσ|2 + 2 |µ|σ + 2","inline":true},{"text":".","element":"span"}],[{"id":"id-45","style":{"width":"99%"},"width":1869,"height":684,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-24.png","element":"img"}],[{"text":"So that ","element":"span"},{"style":{"height":18.18},"width":581.55,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-25.png","element":"img","alt":" L1 ≈ |1 + (1 − α)ψ(1)(α)| + 1/β","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":16},"width":351.96,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-26.png","element":"img","alt":" L2 ≈ V ar [x] + 1/β","inline":true},{"text":". After standardizing ","element":"span"},{"style":{"height":6.8},"width":117.96,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-27.png","element":"img","alt":" �α = α","inline":true},{"text":", ","element":"span"},{"style":{"height":16},"width":150.17,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-28.png","element":"img","alt":"�β = √α","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"V ar ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"] = 1","element":"span"},{"text":", therefore ","element":"span"},{"style":{"height":17.34},"width":70.2,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-29.png","element":"img","alt":"�Lstd1","inline":true,"padRight":true},{"text":"is a function of ","element":"span"},{"style":{"height":18.18},"width":127.31,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-30.png","element":"img","alt":" ψ(1)(α)","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":292.62,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/15-31.png","element":"img","alt":"�Lstd2 = 1 + 1/√α","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Exponential distribution ","element":"span"},{"text":"If ","element":"span"},{"style":{"height":16},"width":227.28,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-0.png","element":"img","alt":" X ∼ Exp(λ)","inline":true,"padRight":true},{"text":"then ","element":"span"},{"style":{"height":16},"width":255.13,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-1.png","element":"img","alt":" X ∼ Γ(1, 1/λ)","inline":true},{"text":", so we can use the previous results so that ","element":"span"},{"style":{"height":13.19},"width":91.42,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-2.png","element":"img","alt":" L1 ≈","inline":true,"padRight":true},{"style":{"fontStyle":"italic"},"text":"V ar ","element":"span"},{"text":"[","element":"span"},{"style":{"fontStyle":"italic"},"text":"x","element":"span"},{"text":"] ","element":"span"},{"text":"and ","element":"span"},{"style":{"height":17.34},"width":144.94,"height":43.35,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-3.png","element":"img","alt":"�Lstd1 = 1","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Rayleigh distribution ","element":"span"},{"text":"This distribution has parameter ","element":"span"},{"style":{"height":11.6},"width":99.11,"height":29,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-4.png","element":"img","alt":" σ > 0","inline":true},{"text":", sufficient statistic ","element":"span"},{"style":{"height":17.39},"width":230.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-5.png","element":"img","alt":" T1(x) = x2/2","inline":true},{"text":", and natural parameter ","element":"span"},{"style":{"height":17.39},"width":201.85,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-6.png","element":"img","alt":" η1 = −1/σ2","inline":true},{"text":".","element":"span"}],[{"text":"We start by computing ","element":"span"},{"style":{"height":20.22},"width":541.48,"height":50.54,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-7.png","element":"img","alt":" ∂η1A(η) = E [T1(x)] = 12 E�x2�","inline":true},{"text":". Using that, for this distribution, ","element":"span"},{"style":{"height":19.81},"width":421.52,"height":49.52,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-8.png","element":"img","alt":" E�xj�= σj2j/2Γ(1 + j2)","inline":true},{"text":":","element":"span"}],[{"style":{"width":"99%"},"width":1870,"height":1030,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-9.png","element":"img"}],[{"text":"Therefore, ","element":"span"},{"style":{"height":17.39},"width":295.53,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-10.png","element":"img","alt":" L1 ≈ µ3/λ + µ/λ","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.39},"width":470.82,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-11.png","element":"img","alt":" L2 ≈ µ/λ + (2µ + λ)/(µλ2)","inline":true},{"text":". After standardizing we have that ","element":"span"},{"style":{"height":17.39},"width":304.07,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-12.png","element":"img","alt":" V ar [�x] = µ3/λ =","inline":true},{"style":{"height":16.58},"width":198.31,"height":41.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-13.png","element":"img","alt":"1 ⇒ λ = µ3","inline":true},{"text":", thus ","element":"span"},{"style":{"height":17.38},"width":273.42,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-14.png","element":"img","alt":"�Lstd1 = 1 + 1/µ2","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":17.38},"width":416.98,"height":43.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-15.png","element":"img","alt":"�Lstd2 = (2 + µ2 + µ4)/µ6","inline":true},{"text":".","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"Inverse Gamma distribution ","element":"span"},{"text":"This distribution has parameters ","element":"span"},{"style":{"height":14.4},"width":141.13,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-16.png","element":"img","alt":" α, β > 0","inline":true},{"text":", sufficient statistics ","element":"span"},{"style":{"height":16},"width":383.99,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-17.png","element":"img","alt":" T1(x) = log x, T2(x) =","inline":true,"padRight":true},{"text":"1","element":"span"},{"style":{"fontStyle":"italic"},"text":"/x","element":"span"},{"text":", and natural parameters ","element":"span"},{"style":{"height":14.8},"width":378.57,"height":37,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-18.png","element":"img","alt":" η1 = −α − 1, η2 = −β","inline":true},{"text":".","element":"span"}],[{"style":{"width":"85%"},"width":1605,"height":942,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/16-19.png","element":"img"}],[{"text":"The interesting bit about these last two estimators is that both explode as they get closer to ","element":"span"},{"text":"2","element":"span"},{"text":", and both vanish as they get further from it, as it can be readily checked by plotting them.","element":"span"}],[{"style":{"fontWeight":"bold"},"text":"E.2 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Scale-invariant smoothness of the Gamma distribution","element":"span"}],[{"text":"In section ","element":"span"},{"href":"#id-40","text":"4.1 ","element":"a"},{"text":"it was introduced the concept of Gamma trick, which acts as a approximation for discrete distributions. Moreover, the discrete variables were assumed to take place in the natural numbers. The reason is that it is beneficial for this approximation that the original variable ","element":"span"},{"style":{"fontStyle":"italic"},"text":"x ","element":"span"},{"text":"is somewhat far from zero.","element":"span"}],[{"text":"This statement it is justified by the following: the second derivative of a Gamma log-likelihood with respect to the first natural parameter, ","element":"span"},{"style":{"height":19.72},"width":233.13,"height":49.31,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-0.png","element":"img","alt":" ∂2η1 log p(x; η)","inline":true},{"text":", rapidly decreases as the data moves away from zero.","element":"span"}],[{"text":"As computed before in Equation ","element":"span"},{"href":"#id-45","text":"28, ","element":"a"},{"text":"one part of ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-1.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"is scale-invariant and has the form ","element":"span"},{"style":{"height":18.19},"width":321.21,"height":45.46,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-2.png","element":"img","alt":" 1 + (1 − α)ψ(1)(α)","inline":true},{"text":". Figure ","element":"span"},{"href":"#id-46","text":"7 ","element":"a"},{"text":"shows a plot of this formula as a function of ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-3.png","element":"img","alt":" α","inline":true},{"text":". It is easy to observe that as the shape parameter grows the value of ","element":"span"},{"id":"id-46","text":"(our approximation to) ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-4.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"drastically decreases.","element":"span"}],[{"style":{"width":"80%"},"width":1498,"height":769,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-5.png","element":"img"}],[{"text":"Figure 7: Plot of ","element":"figcaption","subtype":"caption"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-6.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"for the Gamma distribution.","element":"figcaption","subtype":"caption"}],[{"text":"Finally, by supposing that discrete data are natural numbers, the mode is at least one, which in practice means that the value for ","element":"span"},{"style":{"height":6.8},"width":26,"height":17,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-7.png","element":"img","alt":" α","inline":true,"padRight":true},{"text":"is bigger than ","element":"span"},{"text":"1 ","element":"span"},{"text":"(usually close to ","element":"span"},{"text":"10","element":"span"},{"text":"), thus ensuring that the value of (our approximation to) ","element":"span"},{"style":{"height":13.19},"width":43.12,"height":32.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-8.png","element":"img","alt":" L1","inline":true,"padRight":true},{"text":"mostly depends on the scale-dependent parameter ","element":"span"},{"style":{"height":14.4},"width":23,"height":36,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-9.png","element":"img","alt":" β","inline":true},{"text":".","element":"span"}]]},{"heading":"F Details on the experimental setup","paragraphs":[[{"style":{"fontWeight":"bold"},"text":"F.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Missing imputation models","element":"span"}],[{"text":"Here we give a deeper description of the models used on the experiments. All of them have the form described in the ","element":"span"},{"id":"id-47","text":"problem statement (Section ","element":"span"},{"text":"2)","element":"span"},{"text":", following the graphical model depicted in Figure ","element":"span"},{"href":"#id-47","text":"8.","element":"a"}],[{"style":{"width":"29%"},"width":547,"height":376,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/17-10.png","element":"img"}],[{"text":"Figure 8: Latent variable model describing the joint distribution of Section ","element":"figcaption","subtype":"caption"},{"text":"2.","element":"span","subtype":"caption"}],[{"style":{"fontWeight":"bold"},"text":"Mixture model ","element":"span"},{"text":"Following the form of the join distribution from Section ","element":"span"},{"text":"2, ","element":"span"},{"text":"the mixture model is fully described by:","element":"span"}],[{"style":{"width":"100%"},"width":1872,"height":876,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/18-0.png","element":"img"}],[{"text":"In order to implement the discrete latent parameters such that they can be trained via automatic differentiation, the latent categorical distribution is implemented using a GumbelSoftmax distribution ","element":"span"},{"href":"#id-48","referenceIndex":11,"text":"(Jang et al., ","element":"a"},{"href":"#id-48","referenceIndex":11,"text":"2016) ","element":"a"},{"text":"with a temperature that updates every ","element":"span"},{"text":"20 ","element":"span"},{"text":"epochs as:","element":"span"}],[{"style":{"width":"29%"},"width":549,"height":43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/18-1.png","element":"img"}],[{"style":{"fontWeight":"bold"},"text":"Matrix factorization ","element":"span"},{"text":"Similar to the mixture model, the matrix factorization model follows the same graphical model and it is (almost) fully described by:","element":"span"}],[{"style":{"width":"100%"},"width":1874,"height":1021,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/18-2.png","element":"img"}],[{"text":"General notes:","element":"span"}],[{"text":"• We assume normal latent variables with a standard normal as prior.","element":"span"}],[{"text":"• Hidden layers have 256 neurons.","element":"span"}],[{"text":"• The latent size is set to the ","element":"span"},{"text":"75 % ","element":"span"},{"text":"of the data number of dimensions (before preprocessing).","element":"span"}],[{"text":"• Layers are initialized using a Xavier uniform policy.","element":"span"}],[{"text":"Specifics about the encoder:","element":"span"}],[{"text":"• As we have to avoid using the missing data (since it is going to be our test set), we implement an input-dropout layer as in ","element":"span"},{"href":"#id-31","referenceIndex":16,"text":"Nazabal et al. ","element":"a"},{"href":"#id-31","referenceIndex":16,"text":"(2018)","element":"a"},{"text":".","element":"span"}],[{"text":"• In order to guarantee a common input (and thus, a common well-behaved neural net) across all data scaling methods, we put a batch-normalization layer at the beginning of the encoder. Note that this does not interfere with the goal of this work, which is about the evaluation of the loss function.","element":"span"}],[{"text":"• In order to obtain the distributional parameters of ","element":"span"},{"style":{"height":10},"width":106.13,"height":25,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-0.png","element":"img","alt":" zn, µn","inline":true,"padRight":true},{"text":"and ","element":"span"},{"style":{"height":9.19},"width":42.77,"height":22.98,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-1.png","element":"img","alt":" σn","inline":true},{"text":", we pass the result of the encoder through two linear layers, one for the mean and another for the log-scale. The latter is transformed to the scale via a softplus function.","element":"span"}],[{"text":"Specifics about the decoder:","element":"span"}],[{"style":{"width":"100%"},"width":1872,"height":1218,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-2.png","element":"img"}],[{"text":"When it comes to evaluation we use missing imputation error, that is, for the imputed missing values that are numerical we compute the normalized root mean squared error (NRMSE),","element":"span"}],[{"style":{"width":"65%"},"width":1220,"height":95,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-3.png","element":"img"}],[{"text":"where ","element":"span"},{"style":{"height":10.8},"width":23,"height":27,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-4.png","element":"img","alt":" ˆx","inline":true,"padRight":true},{"text":"is the value inferred by the model, and in the case of nominal data we compute the error rate, i.e.,","element":"span"}],[{"style":{"width":"64%"},"width":1202,"height":116,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-5.png","element":"img"}],[{"text":"The final metric is the mean across dimensions, ","element":"span"},{"style":{"height":19.37},"width":329.23,"height":48.43,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/19-6.png","element":"img","alt":" err = 1D�d err(d)","inline":true},{"text":".","element":"span"}],[{"id":"id-49","style":{"width":"100%"},"width":1873,"height":568,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/20-0.png","element":"img"}],[{"text":"Figure 9: Missing imputation error across different datasets and missing-values percentages. Lower is better.","element":"figcaption","subtype":"caption"}],[{"id":"id-35","style":{"width":"100%"},"width":1872,"height":602,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/20-1.png","element":"img"}],[{"text":"Figure 10: Per-dimension normalized missing imputation error on the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"defaultCredit ","element":"figcaption","subtype":"caption"},{"text":"dataset (lower is better).","element":"figcaption","subtype":"caption"}],[{"id":"id-36","style":{"width":"100%"},"width":1873,"height":596,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/20-2.png","element":"img"}],[{"text":"Figure 11: Per-dimension normalized missing imputation error on the ","element":"figcaption","subtype":"caption"},{"style":{"fontStyle":"italic"},"text":"letter ","element":"figcaption","subtype":"caption"},{"text":"dataset (lower is better).","element":"figcaption","subtype":"caption"}]]},{"heading":"G Additional experimental results","paragraphs":[[{"text":"In this section we show complementary results from the experiments performed in the main paper. First, Figure ","element":"span"},{"href":"#id-49","text":"9 ","element":"a"},{"text":"depicts the same data as Figure ","element":"span"},{"href":"#id-34","text":"3 ","element":"a"},{"text":"of the main paper, but averaging across models instead of missing-values percentages. Second, we plot in Figures ","element":"span"},{"href":"#id-35","text":"10 ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-36","text":"11 ","element":"a"},{"text":"per-dimension barplots of the normalized missing imputation error as in Figure ","element":"span"},{"href":"#id-33","text":"5, ","element":"a"},{"text":"now for the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"defaultCredit ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"letter ","element":"span"},{"text":"datasets, respectively. These figures further validate the argument of ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma ","element":"span"},{"text":"not overlooking any variable, unlike ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-bern ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"std-gamma","element":"span"},{"text":". Finally, we present the results in tabular form, divided by type of variable (discrete vs. continuous) and type of model (mixture model, matrix factorization and VAE). Tables ","element":"span"},{"href":"#id-37","text":"4, ","element":"a"},{"href":"#id-50","text":"5, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-38","text":"6 ","element":"a"},{"text":"show the results obtained with a ","element":"span"},{"text":"10 %","element":"span"},{"text":", ","element":"span"},{"text":"20 %","element":"span"},{"text":", and ","element":"span"},{"text":"50 % ","element":"span"},{"text":"of missing values, respectively. Major differences have been colored to ease their reading.","element":"span"}],[{"text":"As discussed in Section ","element":"span"},{"text":"5, ","element":"span"},{"text":"applying Lipschitz standardization results in an improvement on the imputation error across all datasets, being in the worst case as good as the best of the other methods. We can also observe how this improvement mainly manifests on discrete random variables when the Bernoulli and Gamma tricks are applied, and that the effect of data scaling is less noticeable as the expressiveness of the model increases. There are cases, like in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Adult ","element":"span"},{"text":"dataset, where there is a trade-off on learning the discrete dimensions and worsening the results on continuous dimensions. However, the case where properly learning the discrete distributions translates to an improvement on all dimensions can also occur, as in the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"defaultCredit ","element":"span"},{"text":"dataset.","element":"span"}],[{"text":"Finally, there is an important aspect that qualitatively differentiates ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-gamma ","element":"span"},{"text":"from ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"lip-bern ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic","fontWeight":"bold"},"text":"std-gamma","element":"span"},{"text":". The consequence of Lipschitz standardizing every dimension is obtaining the more balanced learning that we aim for, and in cases with high heterogeneity, such as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"defaultCredit ","element":"span"},{"text":"and ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Adult","element":"span"},{"text":", the stability and robustness of the algorithm increases. A clear example of this can be seen by checking the evolution of the ","element":"span"},{"style":{"fontStyle":"italic"},"text":"defaultCredit ","element":"span"},{"text":"dataset on Tables ","element":"span"},{"href":"#id-37","text":"4, ","element":"a"},{"href":"#id-50","text":"5, ","element":"a"},{"text":"and ","element":"span"},{"href":"#id-38","text":"6. ","element":"a"},{"style":{"fontStyle":"italic"},"text":"It is worth-noting that ","element":"span"},{"style":{"fontWeight":"bold"},"text":"lip-gamma ","element":"span"},{"style":{"fontStyle":"italic"},"text":"keeps achieving consistent results even under a half missing-data regime, which is impressive.","element":"span"}],[{"text":"Table 4: Missing imputation error with a ","element":"figcaption","subtype":"caption"},{"text":"10 % ","element":"figcaption","subtype":"caption"},{"text":"of missing data.","element":"figcaption","subtype":"caption"}],[{"text":"Discrete ","element":"span"},{"text":"Continuous ","element":"span"},{"id":"id-37","style":{"fontStyle":"italic"},"text":"Imputation error ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mixture ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Matrix fact. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"VAE ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mixture ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Matrix fact. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"VAE","element":"span"}],[{"style":{"width":"99%"},"width":1856,"height":1347,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/22-0.png","element":"img"}],[{"text":"Table 5: Missing imputation error with a ","element":"figcaption","subtype":"caption"},{"text":"20 % ","element":"figcaption","subtype":"caption"},{"text":"of missing data.","element":"figcaption","subtype":"caption"}],[{"text":"Discrete ","element":"span"},{"text":"Continuous ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Imputation error ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mixture ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Matrix fact. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"VAE ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mixture ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Matrix fact. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"VAE","element":"span"}],[{"id":"id-50","style":{"width":"99%"},"width":1856,"height":1347,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/23-0.png","element":"img"}],[{"text":"Table 6: Missing imputation error with a ","element":"figcaption","subtype":"caption"},{"text":"50 % ","element":"figcaption","subtype":"caption"},{"text":"of missing data.","element":"figcaption","subtype":"caption"}],[{"text":"Discrete ","element":"span"},{"text":"Continuous ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Imputation error ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mixture ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Matrix fact. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"VAE ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Mixture ","element":"span"},{"style":{"fontStyle":"italic"},"text":"Matrix fact. ","element":"span"},{"style":{"fontStyle":"italic"},"text":"VAE","element":"span"}],[{"id":"id-38","style":{"width":"99%"},"width":1856,"height":1347,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/2002.11369/images/24-0.png","element":"img"}]]}],"_version":"3.3.2"},"paperNode":"$1b:props:children:props:children:0:props:product"}]]]}]}]