28:["$","$L31",null,{"isWhiteLabelled":false,"children":["$","$Lc",null,{"pt":{"compact":0,"expanded":3},"children":[["$","$L32",null,{"noStar":true,"publisher":true,"task":true,"params":true,"size":"xl","product":{"id":"eyJwYXBlcklEIjoiMTYxMC4wNTczNSIsInB1Ymxpc2hlciI6ImFyeGl2In0=","publisher":"arxiv","updated":"2016-10-18T18:35:09.000Z","paperID":"1610.05735","published":"2016-10-18T18:35:09.000Z","authors":"[\"Daniel Ritchie\",\"Paul Horsfall\",\"Noah D. Goodman\"]","title":"Deep Amortized Inference for Probabilistic Programs","scoreTrending":null,"summary":"$33","lastCheckedForCode":"2022-09-02T17:40:57.449Z","links":[{"id":"eyJ1cmwiOiJodHRwczovL3BhcGVyc3dpdGhjb2RlLmNvbS9wYXBlci9kZWVwLWFtb3J0aXplZC1pbmZlcmVuY2UtZm9yLXByb2JhYmlsaXN0aWMifQ==","type":"pwc","url":"https://paperswithcode.com/paper/deep-amortized-inference-for-probabilistic","data":null}],"reposConnection":{"edges":[]},"models":[],"tags":[{"id":"eyJuYW1lIjoicHJvYmFiaWxpc3RpYyBwcm9ncmFtbWluZyIsInR5cGUiOiJ0YXNrIn0=","name":"probabilistic programming","description":"Probabilistic programming involves inputting a model and data, and outputting a distribution over possible explanations. It's used in real-world scenarios like predicting stock market trends or diagnosing diseases, where uncertainty is inherent and probabilistic models can provide a range of possible outcomes.","scoreTrending":null,"count":{"stars":380,"papers":242,"models":391},"__typename":"Tag"}],"summaries":[],"emailsConnection":{"edges":[]},"__typename":"paper","authorArray":["Daniel Ritchie","Paul Horsfall","Noah D. Goodman"]}}],["$","$L25",null,{"container":true,"columns":100,"spacing":{"compact":0,"expanded":2,"large":3},"children":[["$","$L25",null,{"size":{"compact":100,"expanded":100,"large":68},"children":[["$","$8",null,{"children":["$","$L34",null,{"publisher":"arxiv","paperID":"1610.05735","product":{"paper":"$28:props:children:props:children:0:props:product","models":"$28:props:children:props:children:0:props:product:models"},"isWhiteLabelled":false}]}],["$","$8",null,{"children":["$","$L35",null,{"article":"$L36","model":"$undefined"}]}]]}],["$","$L25",null,{"size":"grow","children":["$","$L37",null,{}]}]]}],["$","$8",null,{"children":null}],[["$","audio",null,{"id":"tts"}],["$","$L38",null,{"paperID":"1610.05735","publisher":"arxiv","paperJSON":{"title":"Deep Amortized Inference for Probabilistic Programs","paperID":"1610.05735","avgLineHeight":10.9,"imgScale":4,"sections":[{"heading":"Abstract","paragraphs":[[{"text":"Probabilistic programming languages (PPLs) are a powerful modeling tool, able to represent any computable probability distribution. Unfortunately, probabilistic program inference is often intractable, and existing PPLs mostly rely on expensive, approximate sampling-based methods. To alleviate this problem, one could try to learn from past inferences, so that future inferences run faster. This strategy is known as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"amortized inference","element":"span"},{"text":"; it has recently been applied to Bayesian networks [","element":"span"},{"href":"#id-0","referenceIndex":28,"text":"28","element":"a"},{"text":", ","element":"span"},{"href":"#id-1","referenceIndex":22,"text":"22","element":"a"},{"text":"] and deep generative models [","element":"span"},{"href":"#id-2","referenceIndex":20,"text":"20","element":"a"},{"text":", ","element":"span"},{"href":"#id-3","referenceIndex":15,"text":"15","element":"a"},{"text":", ","element":"span"},{"href":"#id-4","referenceIndex":24,"text":"24","element":"a"},{"text":"]. This paper proposes a system for amortized inference in PPLs. In our system, amortization comes in the form of a parameterized ","element":"span"},{"style":{"fontStyle":"italic"},"text":"guide program","element":"span"},{"text":". Guide programs have similar structure to the original program, but can have richer data flow, including neural network components. These networks can be optimized so that the guide approximately samples from the posterior distribution defined by the original program. We present a flexible interface for defining guide programs and a stochastic gradient-based scheme for optimizing guide parameters, as well as some preliminary results on automatically deriving guide programs. We explore in detail the common machine learning pattern in which a ‘local’ model is specified by ‘global’ random values and used to generate independent observed data points; this gives rise to amortized local inference supporting global model learning.","element":"span"}]]},{"heading":"1 Introduction","paragraphs":[[{"text":"Probabilistic models provide a framework for describing abstract prior knowledge and using it to reason under uncertainty. Probabilistic programs are a powerful tool for probabilistic modeling. A probabilistic programming language (PPL) is a deterministic programming language augmented with random sampling and Bayesian conditioning operators. Performing inference on these programs then involves reasoning about the space of executions which satisfy some constraints, such as observed values. A universal PPL, one built on a Turing-complete language, can represent any computable probability distribution, including open-world models, Bayesian non-parameterics, and stochastic recursion ","element":"span"},{"href":"#id-5","referenceIndex":6,"text":"[6, ","element":"a"},{"href":"#id-6","referenceIndex":19,"text":"19, ","element":"a"},{"href":"#id-7","referenceIndex":33,"text":"33]","element":"a"},{"text":".","element":"span"}],[{"text":"If we consider a probabilistic program to define a distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":")","element":"span"},{"text":", where ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"are (latent) intermediate variable and ","element":"span"},{"style":{"fontWeight":"bold"},"text":"y ","element":"span"},{"text":"are (observed) output data, then sampling from this distribution is easy: just run the program forward. However, computing the posterior distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":") ","element":"span"},{"text":"is hard, involving an intractable integral. Typically, PPLs provide means to approximate the posterior using Monte Carlo methods (e.g. MCMC, SMC), dynamic programming, or analytic computation.","element":"span"}],[{"text":"These inference methods are expensive because they (approximately) solve an intractable integral from scratch on every separate invocation. But many inference problems have shared structure: it is reasonable to expect that computing ","element":"span"},{"style":{"height":16},"width":128.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1610.05735/images/0-0.png","element":"img","alt":" p(x|y1)","inline":true,"padRight":true},{"text":"should give us some information about how to compute ","element":"span"},{"style":{"height":16},"width":128.86,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1610.05735/images/0-1.png","element":"img","alt":" p(x|y2)","inline":true},{"text":". In fact, there is reason to believe that this is how people are able to perform certain inferences, such as visual perception, so quickly—we have perceived the world many times before, and can leverage that accumulated knowledge when presented with a new perception task [","element":"span"},{"href":"#id-8","referenceIndex":3,"text":"3","element":"a"},{"text":"]. This idea of using the results of previous inferences, or precomputation in general, to make later inferences more efficient is called ","element":"span"},{"style":{"fontStyle":"italic"},"text":"amortized inference ","element":"span"},{"href":"#id-8","referenceIndex":3,"text":"[3, ","element":"a"},{"href":"#id-0","referenceIndex":28,"text":"28]","element":"a"},{"text":".","element":"span"}],[{"style":{"fontStyle":"italic"},"text":"Learning ","element":"span"},{"text":"a generative model from many data points is a particularly important task that leads to many related inferences. One wishes to update global beliefs about the true generative model from individual data points (or batches of data points). While many algorithms are possible for this task, they all require some form of ‘parsing’ for each data point: doing posterior inference in the current generative model to guess values of local latent variable given each observation. Because this local parsing inference is needed many many times, it is a good candidate for amortization. It is plausible that learning to do local inference via amortization would support faster and better global learning, which gives more useful local inferences, leading to a virtuous cycle.","element":"span"}],[{"text":"This paper proposes a system for amortized inference in PPLs, and applies it to model learning. Instead of computing ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":") ","element":"span"},{"text":"from scratch for each ","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":", our system instead constructs a program ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":") ","element":"span"},{"text":"which takes ","element":"span"},{"style":{"fontWeight":"bold"},"text":"y ","element":"span"},{"text":"as input and, when run forward, produces samples distributed approximately according to the true posterior ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":"|","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":")","element":"span"},{"text":". We call ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"a ","element":"span"},{"style":{"fontStyle":"italic"},"text":"guide program","element":"span"},{"text":", following terminology introduced in previous work [","element":"span"},{"href":"#id-9","referenceIndex":11,"text":"11","element":"a"},{"text":"]. The system can spend time up-front constructing a good approximation ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"so that at inference time, sampling from ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"is both fast and accurate.","element":"span"}],[{"text":"There is a huge space of possible programs ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"one might consider for this task. Rather than posing the search for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"as a general program induction problem (as was done in previous work [","element":"span"},{"href":"#id-9","referenceIndex":11,"text":"11","element":"a"},{"text":"]), we restrict ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"to have the same control flow as the original program ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":", but a different data flow. That is, ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"samples the same random choices as ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p ","element":"span"},{"text":"and in the same order, but the data flowing into those choices comes from a different computation. In our system, we represent this computation using neural networks. This design choice reduces the search for ","element":"span"},{"style":{"fontStyle":"italic"},"text":"q ","element":"span"},{"text":"to the much simpler continuous problem of optimizing the weights for these networks, which can be done using stochastic gradient descent.","element":"span"}],[{"text":"Our system’s interface for specifying guide programs is flexible enough to subsume several popular recent approaches to variational inference, including those that perform both inference and model learning. To facilitate this common pattern we introduce the ","element":"span"},{"style":{"height":10.4},"width":111.64,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1610.05735/images/1-0.png","element":"img","alt":" mapData","inline":true,"padRight":true},{"text":"construct which represents the boundary between global “model” variables and variables local to the data points. Our system leverages the independence between data points implied by ","element":"span"},{"style":{"height":10.4},"width":111.64,"height":26,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1610.05735/images/1-1.png","element":"img","alt":" mapData","inline":true,"padRight":true},{"text":"to enable mini-batches of data and variance reduction of gradient estimates. We evaluate our proof-of-concept system on a variety of Bayesian networks, topic models, and deep generative models.","element":"span"}],[{"text":"Our system has been implemented as an extension to the WebPPL probabilistic programming language [","element":"span"},{"href":"#id-10","referenceIndex":7,"text":"7","element":"a"},{"text":"]. Its source code can be found in the WebPPL repository, with additional helper code at ","element":"span"},{"href":"https://github.com/probmods/webppl-daipp","style":{"fontFamily":"monospace"},"text":"https://github.com/probmods/webppl-daipp","element":"a"},{"text":".","element":"span"}]]},{"heading":"2 Background","paragraphs":[[{"id":"id-14","style":{"fontWeight":"bold"},"text":"2.1 ","element":"span"},{"style":{"fontWeight":"bold"},"text":"Probabilistic Programming Basics","element":"span"}],[{"text":"For our purposes, a probabilistic program defines a generative model ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"style":{"fontStyle":"italic"},"text":", ","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":") ","element":"span"},{"text":"of latent variables ","element":"span"},{"style":{"fontWeight":"bold"},"text":"x ","element":"span"},{"text":"and data ","element":"span"},{"style":{"fontWeight":"bold"},"text":"y","element":"span"},{"text":". The model factors as:","element":"span"}],[{"id":"id-24","style":{"width":"65%"},"width":1045,"height":86,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1610.05735/images/1-2.png","element":"img"}],[{"text":"The prior probability distribution ","element":"span"},{"style":{"fontStyle":"italic"},"text":"p","element":"span"},{"text":"(","element":"span"},{"style":{"fontWeight":"bold"},"text":"x","element":"span"},{"text":") ","element":"span"},{"text":"decomposes as a product of conditionals ","element":"span"},{"style":{"height":16},"width":308.42,"height":40,"src":"https://cdn.bytez.com/mobilePapers/v2/arxiv/1610.05735/images/1-3.png","element":"img","alt":" pi(xi|x