Beyond Application End-Point Results: Quantifying Statistical Robustness of MCMC Accelerators

2020·Arxiv

Abstract

Abstract

Statistical machine learning often uses probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated with specialized hardware by exploiting parallelism and optimizing the design using various approximation techniques. Current methodologies for evaluating correctness of probabilistic accelerators are often incomplete, mostly focusing only on end-point result quality (“accuracy”). It is important for hardware designers and domain experts to look beyond end-point “accuracy” and be aware of the hardware optimizations impact on other statistical properties.

This work takes a first step towards defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators beyond end-point result quality. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply our framework to a representative MCMC accelerator and surface design issues that cannot be exposed using only application end-point result quality. Applying the framework to guide design space exploration shows that statistical robustness comparable to floating-point software can be achieved by slightly increasing the bit representation, without floating-point hardware requirements.

I. INTRODUCTION

Statistical machine learning, like other methods in artificial intelligence, has become an important workload for computing systems. Such workloads often utilize probabilistic algorithms, such as Markov Chain Monte Carlo (MCMC) methods, which enable the potential to provide generalized frameworks to solve a wide range of problems. As alternatives to Deep Neural Networks, these algorithms provide easier access to interpreting why a given result is obtained through their model transparency and statistical properties. Many specialized accelerators are proposed to address the sampling inefficiency of probabilistic algorithms [4], [7], [31], [36]–[38], [44], [53], by utilizing approximation techniques to improve the hardware efficiency, such as reducing bit representation, truncating small values to zero, or simplifying the random number generator.

Understanding the influence of these approximations on the application results is crucial to meet the quality requirement in real applications. A hardware accelerator should provide correct execution of target algorithms. A common approach to evaluating correctness is to compare the end-point result quality (“accuracy”) against accurately-measured or handlabeled ground-truth data using community-standard benchmarks and metrics: the hardware execution is considered to be correct if it provides comparable “accuracy” to the softwareonly implementations that do not have these approximations. However, in the domain of probabilistic computing/algorithms, correctness is defined by more than the end-point result of executing the algorithm, and includes additional statistical properties that convey uncertainty and interpretability about the end-point result. End-point “accuracy” is necessary but not sufficient to claim correctness. Current methodologies for evaluating probabilistic accelerators are often incomplete or adhoc in evaluating correctness, focusing only on end-point “accuracy” or limited statistical properties. Failure to adequately account for domain-defined correctness can have adverse or catastrophic outcomes, such as a surgeon failing to completely remove a tumor due to incorrect uncertainty in a segmented image [9], [43]. Furthermore, end-point “accuracy” may not always be possible since ground-truth data is not always available. It is important for hardware designers and domain experts to look beyond end-point “accuracy” and be aware of the impact of optimizations on other statistical properties. Therefore, a probabilistic architecture should provide some measure (or guarantee) of statistical robustness.

This work takes a first step towards defining metrics and a methodology for quantitatively evaluating correctness of probabilistic accelerators beyond end-point result quality. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit (Contribution 1). Each pillar has at least one quantitative empirical metric, does not require ground-truth data, and collectively these pillars enable comparison of specialized hardware to 64-bit floating-point (FP64) software. We expose several challenges with naively applying existing popular metrics for our purposes, including: high dimensionality of the target applications, and random variables with zero variance. Therefore, we modify the existing methodologies for sampling quality and convergence diagnostic, and propose a new metric for convergence diagnostic (Contribution 2). We call on domain experts to develop metrics with stronger theoretical foundations to account for common hardware optimizations. Below is a summary of each pillar.

Pillar I) Sampling Quality. The intrinsic nature of MCMC methods creates dependency between samples. A sufficient number of independent samples are needed to converge and produce high-quality results. We use Effective Sample Size (ESS) to measure the number of independent samples drawn from an MCMC run, and report the arithmetic mean as a scalar metric. Low ESS indicates that more iterations may be required to generate sufficient independent samples.

Pillar II) Convergence Diagnostic. The total running time of an MCMC run is determined by when it converges. Convergence can be measured by Gelman-Rubin’s [6], but this metric is undefined for variables with zero variance. Therefore, we propose a process to determine convergence that accounts for zero variance and a new metric—convergence percentage—based on , to measure the total percentage of converged results. Low convergence percentage indicates that more iterations are required for the model to converge.

Pillar III) Goodness of Fit. In the absence of ground-truth data (labeled data), it is important to understand the differences between the baseline FP64 and hardware end-point results to evaluate the overall quality of the hardware. We provide two “goodness of fit” approaches: 1) Root Mean Squared Error (RMSE) on application specific data relative to a software reference, and 2) Jensen-Shannon Divergence (JSD) to evaluate all possible data inputs and provide worst-case distribution divergence.

As a case study, we demonstrate the framework in a representative probabilistic accelerator [57] and show that 1) end-point accuracy alone is insufficient, particularly for predicting outcome for previously unseen inputs, and 2) that FP64 is insufficient as ground truth since in some cases more limited precision can produce more accurate end-point results based on labeled data (Contribution 3). The analysis shows that the accelerator achieves the same application end-point result quality as FP64 software, confirming the previous work. However, we show the accelerator differs with FP64 in ESS and convergence percentage, and reveal that applications need to run more iterations on the accelerator to achieve the same statistical robustness as FP64, reducing the accelerator’s effective speedup. We also explore the design space of the above accelerator to expose the trade-offs between statistical robustness and area/power (Contribution 4). We find that considerable improvement in statistical robustness, comparable to the FP64 software, can be achieved by slightly increasing the bit precision from 4 to 6 and removing an approximation technique, with only 1.20area and 1.10power overhead.

The remainder of this paper is organized as follows. Sec. II provide the necessary background and motivation for this work. The detailed description of three pillars is in Sec. III. Sec. IV describes the analysis of statistical robustness on a representative accelerator and we perform design space exploration using the three pillars in Sec. V. Sec. VI discusses limitations and future work, related work is presented in Sec. VII, and Sec. VIII concludes the paper.

II. BACKGROUND AND MOTIVATION

A. Probabilistic Algorithms

Probabilistic (stochastic) algorithms are a powerful approach used in many application (e.g., image analysis [16], robotics, natural language processing [12], global health [21], and wireless communications [22]). Probabilistic algorithms

Fig. 1. Stereo vision using Markov Chain Monte Carlo (Markov Random Field Gibbs Sampling). Note that Sampling is performed in the inner loop.

solve problems by randomly inferring the parameters in the probabilistic models to explain observed data, which create opportunities to provide generalized frameworks and are the only practical approach to solve certain problems, such as high-dimensional inference. Compared with deep neural networks, probabilistic models are “conceptually simple, compositional, and interpretable” [17]. Bayesian Inference is an important framework in probabilistic models, which updates the probability estimate for a hypothesis by combining information from prior beliefs and observed data. Suppose X is the latent random variable of interest. The goal is to retrieve the posterior distribution p(X|D) given the prior beliefs of X and the observed data D using Bayes’ theorem: p(D|X)p(X), where p(X) is the prior distribution of X, and p(D|X) is the likelihood of observing D given a certain value of X. Both X and D can be scalars or multidimensional vectors. As the dimension of D and X increases, analytically or numerically deriving the exact result of a posterior distribution becomes complicated and intractable. One approach to break “the curse of dimensionality” is Markov Chain Monte Carlo (MCMC), methods that solve the problems by iteratively generating samples on random variables and eventually converge to a result regardless of the initial stage.

Fig. 1 shows an example of using an MCMC method, Markov Random Field (MRF) Gibbs Sampling, in stereo vision, demonstrated by previous work [5]. Stereo vision reconstructs the depth information of objects in an image pair by matching the corresponding pixels that represent the same objects. The results are presented in a disparity map, indicating the depth of the corresponding objects in the image (lighter is closer). As shown in Fig. 1, the MCMC method iterates the image pixels by considering the disparity of each pixel as a random variable. For each pixel, it evaluates probabilities of each possible label (disparity outcome) and draws a sample as the output label. Each probability is determined by the neighboring pixels label values and the initial pixel data values of the image pair, defined by First-Order MRF graphical model [16]. The outer loop (a.k.a. iteration) iterates on the whole image until convergence.

MCMC methods rely on efficiently generating samples from a parameterized distribution, which involves step 1) efficiently computing the parameters of the distribution to sample from based on observed data, and step 2) efficiently generating samples from the parameterized distribution. Unfortunately, as described in previous work [53], sampling overhead can be notably high: step 2) alone takes hundreds of CPU cycles for a simple distribution. One approach to reduce the sampling overhead is to use approximation techniques in algorithms [40], [45], [55]. Deterministic methods, such as Expectation Propagation and Variational Bayesian, are alternatives to MCMC. Although these methods are often more efficient in the applied cases, domain experts prefer conceptually straightforward, mathematically simple, yet accurate methods.

B. Specialized Probabilistic Accelerators

Meeting the needs of domain experts may be achieved by accelerating sampling through hardware specialization. Previous efforts seek to efficiently generate specific types of distributions using FPGAs [1], [49], specialized circuits [8], specialized architectures are proposed to accelerate specific algorithms and models, such as a Stochastic Transition Circuit [38], dedicated MCMC models [32], [36], [44], Bayesian Neural Network [7], Stochastic Gradient Descent [37], an ASIC accelerator for Bayesian Networks [30], CMOS-hybrid MRF Gibbs Sampling Unit [53], [57], and workflows to compile probabilistic programming language to hardware accelerators in Chisel code [4]. Many of these accelerators use various forms of approximation (e.g., limited bit precision, pseudo random number generators, etc.) to reduce area/power, allowing more individual units on a single chip and thus improving overall performance.

Understanding the influence of these hardware optimizations on correctness is a critical aspect of any design process. An intuitive approach is to evaluate the end-point (final) result quality (“accuracy”) for applications using community-standard benchmarks and metrics. However, end-point “accuracy” is necessary but not sufficient to claim correctness: 1) given the uncertainty of input data, the observed end-point result quality has no indication of “accuracy” for unseen data, and thus just making statements on the observed “accuracy” is not enough, 2) many applications look beyond the end-point “accuracy” and consider uncertainty quantification, and 3) measuring “accuracy” may not always be possible as ground-truth data is not always accessible and thus no comparison to an end-point result is available. Probabilistic algorithm domain experts frequently use statistical properties to evaluate the confidence on the unseen data and to help uncertainty quantification (e.g., tumor resection [9], [43]). A probabilistic architecture should provide some measure (or guarantee) of statistical robustness.

This paper represents a first step toward articulating a set of metrics and methodology for quantifying the statistical robustness of probabilistic accelerators. Sec. III presents our proposed metrics and we use an accelerator from Zhang, et al. [57] (described below) as a case study to demonstrate how

Fig. 2. The SPU pipeline derived from RSU-G [57]. The sampling stage is replaced with a CMOS discrete sampler.

to analyze an existing accelerator and to perform design space exploration.

C. A Representative Probabilistic Accelerator

As a case study, we consider the Gibbs Sampling accelerator design described by Zhang, et al. [57] implemented entirely in CMOS using pseudo random number generation instead of molecular optical devices (cf. [57] Sec. IV.C). Fig. 2 shows the baseline pipeline design, which we call a Stochastic Processing Unit (SPU). It is divided into four main stages with two internal decoupling FIFOs and an inverse transform method is used for the discrete sampler.

Given the data and neighbor labels, the first stage computes the total energy of a possible label E(i) (Eq. 1) each cycle, where and are application parameters. The energy E(i) is then dynamically scaled using subtraction to maximize the dynamic range (Eq. 2). Both E(i) and are 8-bit unsigned integers. In the third stage, the scaled energy is converted to a scaled probability represented in 4-bit unsigned integer. The original probability is computed by which is represented as a real number between [0,1] using floating-point in software, where T is a fixed parameter per outer iteration. However, the probability is scaled using Eq. 3 and truncated using Eq. 4 to match the unsigned integer representation, where is the number of bits in the scaled probability output . Additionally, probability truncation drives all scaled probabilities that are less than one to zero and approximation rounds all scaled probabilities down to the nearest integer value (Eq. 4). The value of can be pre-computed and stored in a look-up table (LUT). The values in the LUT need updates if T changes between iterations. The final stage of SPU generates a discrete sample per variable based on the probabilities of all possible label values using the least 12-bits of a 19-bit LFSR to implement the inverse transform sampling.

TABLE I SPU RESULT QUALITY FROM ONE RUN PER DATASET.

The SPU supports two operating modes: 1) pure-sampling and 2) optimization (simulated annealing). Pure-sampling iteratively generates Gibbs samples using constant temperature T, where T is considered a parameter obtained during model training. When converged, the estimated distribution of a random variable (e.g., distribution of possible disparities in a pixel) can be obtained by collecting the latest N samples. An exact result can be obtained from the mode of the estimated distribution, the most frequent label. The optimization mode uses simulated annealing to converge to an exact result faster by strategically decreasing the temperature T [16]. T is initially high so that every label has similar probability of being selected. As T decreases, labels with the lowest energy are likely to be selected, eventually leading to convergence. The optimization mode requires fewer iterations than pure-sampling, but cannot provide an estimated distribution. The previous work [57] evaluates only the optimization mode.

We implement the SPU in Verilog and use QuestaSim simulation to evaluate the end-point result quality of the same three applications assessed in the previous work [57]: image segmentation, motion estimation, and stereo vision. We use community-standard metrics (e.g., variation of information in image segmentation [39], [56], end-point error in motion estimation [2], and bad-pixel percentage in stereo vision [48]). Tab. I shows the result quality comparison between FP64 software and the SPU. Each result is collected by a single run per dataset in simulated annealing mode. We validate that the SPU with a simple 19-bit LFSR as its random number generator (RNG) achieves the same result quality as the software. Image segmentation results indicate the same conclusion and are omitted for brevity. We also obtain similar high quality applicatoin results on an Intel Arria 10 FPGA prototype. Despite these results we are left with the question: What do the results in Tab. I indicate about accelerator statistical robustness? The short answer is nothing. The following sections present our initial efforts toward providing a better answer.

III. THREE PILLARS OF STATISTICAL ROBUSTNESS

To identify appropriate measures of hardware statistical robustness, we draw on known techniques utilized by domain experts to evaluate their models and algorithms. Ideally, we could formally prove bounds on relevant metrics [14]. Unfortunately, some hardware optimizations (e.g., truncation to zero) make formal proofs extremely difficult or impossible. A provable architecture introduces more complicated hardware. Therefore, we rely on existing empirical diagnostic tests for MCMC techniques, that are based on foundations in statistics, to establish three pillars for assessing a probabilistic accelerator’s statistical robustness: 1) sampling quality [50], 2) convergence [11], and 3) goodness of fit. Each pillar has at least one quantitative measure, and provides insight to application users and to hardware designers. Collectively these pillars help in understanding/explaining end-point results, and can indicate the performance of the MCMC execution, such as how many iterations are required to converge. Note that the statistical robustness is jointly affected by the algorithm and hardware architecture. Therefore, we compare hardware results with FP64 software as the baseline to extract the impact of hardware optimizations. The remainder of this section presents our proposed three pillars for evaluating statistical robustness of an MCMC accelerator.

A. Pillar 1: Sampling Quality

A sampling algorithm with perfect sampling quality generates independent samples. However, an MCMC sample is drawn based on the current values of random variables— the outcome of samples in the previous iteration. Recall in stereo vision, the disparity of a pixel depends on the current states of its neighborhood. This dependency creates correlations between samples which is nontrival until several subsequent samples are drawn, which can be represented as an autocorrelation time . This implies that by generating n samples from MCMC, only samples can be considered independent. A sufficient number of independent samples are required to derive meaningful statistical measures (e.g., mean and variance). Note that the sample dependency is an intrinsic property of MCMC algorithms and exists even with a perfect random number generator and FP64 precision.

Effective Sample Size (ESS) is commonly used as a sampling quality metric that represents how many independent samples are drawn among all the dependent samples. In general, higher ESS indicates the MCMC sampler is more efficient at generating independent samples. Unfortunately, there is not consensus on a single ESS definition [18]. This paper uses the definition discussed by Kass et al. [29] based on autocorrelation. Since closed form expressions for ESS are difficult we estimate ESS using known methods [36], [50].

We estimate ESS on a univariate random variable using Eq. 5, where n is the number of dependent MCMC samples (iterations) and is the autocorrelation function of the sample sequence. We sum up the first K contiguous lags where . Theoretically, ESS = n provides the best sampling quality where all samples are independent; however, Eq. 5 is an estimate of ESS, thus it is numerically possible that ESS > n. Furthermore, ESS has no definition when all collected samples have the same value (zero variance), which is possible in practice as shown in Sec. IV.

Many MCMC problems are high-dimensional (many random variables). For example, in stereo vision a 320320 input image has 102,400 dimensions. An ideal metric can report a scalar ESS value to account for all dimensions as a single random variable. While methods exist to report multivariate ESS [51], to our knowledge they aren’t practical in our case and they do not allow zero variance for any variable. In this paper, we calculate ESS per dimension (random variable/pixel) and report the arithmetic mean ESS. Sec. IV describes how we handle zero variance cases, and we encourage domain experts to develop an evaluation method with stronger theoretical foundations to account for these zero variance cases. Pillar Insight. If ESS is low it may take more MCMC iterations to achieve an acceptable ESS. If a hardware accelerator produces substantially lower ESS than software, the additional iterations may reduce its effective speedup.

B. Pillar 2: Convergence Diagnostic

An important question for an MCMC method is when to stop iterating, determined by when the MCMC is converged. Similar to effective sample size (ESS), the time to convergence is used to analyze algorithms and input data when using software even with FP64 and good random number generators. Multiple methods exist to measure the convergence. A comprehensive review is provided by Cowles et al. [11]. We use Gelman-Rubin’s [6], a popular quantitative method provided by many statistical packages, to measure whether a univariate random variable (e.g., a pixel in stereo vision) is converged at a certain iteration.

Gelman-Rubin’s (potential scale reduction factor) estimates the convergence by comparing the between-chain variance (B) and within-chain variance (W) across multiple independent runs on the same MCMC instance1. Equations 6 and 7 show the computation to obtain an from m independent MCMC runs, each with n samples, where is an estimate on the variance of random variable. As a rule of thumb [6], [15], a univariate random variable is considered converged when . Typically larger indicates that more iterations are needed to converge. Note that the method requires a random value initialized from an overdispersed distribution. We meet this requirement by initializing random variables (i.e., initial labels of pixels) uniform-randomly.

A scalar convergence diagnostic is preferred for multidimensional applications. Similar to ESS, handling high dimensions and the random variables with estimated zero variance (W = 0) is challenging using existing methods [6], [52]. The original Gelman-Rubin’s metric has no definition at W = 0. We consider a random variable converged when B = 0 and W = 0, which indicates all samples are the same value from different iterations and MCMC runs. A

Fig. 3. Process to determine convergence of a univariate random variable.

random variable is not considered converged when B > 0 and W = 0, which indicates samples are the same value within MCMC runs, but different across MCMC runs. Fig. 3 shows our process to determine whether a univariate random variable is converged. We propose convergence percentage, the percentage of converged univariate random variables, as our new metric.

Pillar Insight. Low convergence percentage indicates that more iterations are needed for the model to converge. If a hardware accelerator takes substantially more iterations to converge than software, the additional iterations may reduce its effective speedup.

C. Pillar 3: Goodness of Fit

Finally, understanding the “goodness of fit”—the difference between end-point results produced by the software and by the hardware accelerator—is critical to evaluate the overall quality of the hardware accelerator. A straightforward approach is to compare the end-point result quality using community-standard benchmarks and metrics. However, ground truth data are not always available. We provide two “goodness of fit” approaches: 1) using application specific data to measure how good the hardware results fit to a reference software result, and 2) using a distribution divergence measurement to evaluate all possible data inputs and provide worst-case divergence.

1) Application Data Analysis: We are interested in how

close/different the results are between the software and hardware. For example, the difference of disparities across the whole image given the same image inputs. Popular quantitative metrics for “goodness of fit” include Root Mean Squared Error (RMSE) and coefficient of determination (). RMSE measures the average squared difference between the result from a hardware MCMC run and a reference software run. The range of RMSE is from 0 to infinity, where lower is better. Zero RMSE means the hardware produces identical results to the software (i.e., perfectly fit). measures the portion of variance in the hardware results that can be explained by the software results. Typically, has a more intuitive range of [0,1] than RMSE. A higher value is preferred and means hardware and software have perfect fit. However, variance of the software result is in the denominator of a negative term in formula. That means results can have a misleadingly bad value even with a good RMSE if variance of the software results is small. Since our target is to measure the result differences caused by hardware approximations, we choose using RMSE as our metric. A detailed introduction to these statistics can be found elsewhere [27].

Due to the stochastic nature of MCMC methods, each MCMC run can have different end-point results for either software or hardware. To account for this variation, we compute RMSE for both hardware and software with respect to a reference software result from multiple MCMC runs. The reference software result is obtained using the mode of multiple software runs to minimize the result variation in a single software reference run.

2) Data-independent Analysis: Recall from Sec. II-A that

step-one of sampling is computing the probability distribution to sample from. Hardware approximations in this step, such as limited precision and truncation, introduce divergence from the distribution obtained from FP64 software. Quantifying the distribution divergence of hardware from software provides insights on why the results are good (or bad) and gives the worst-case divergence in arbitrary data inputs. This can be measured using a popular divergence measure Kullback-Leibler (KL) divergence (). However, value goes to infinity when any entry of is zero while is non-zero, which is likely to happen under the hardware technique of truncating small probabilities to zero, and thus cannot be directly applied in our study. We use Jensen-Shannon Divergence (JSD) [33] instead as the divergence measurement. JSD, shown in Eq. 8, is defined based on KL-devergence, where . Note that unlike KL-divergence, is a symmetric method where .

The goal of using JSD is to evaluate the distribution divergence caused by hardware approximations on arbitrary data inputs, which gives some insights on how the hardware performs on unobserved data and provides worst-case scenarios. Unfortunately, evaluating JSD on arbitrary data inputs for a random variable with many possible labels, such as in stereo vision, is challenging in both analytical and empirical approaches given the complicated mathematical representation and the large parameter space. In this work, we start from evaluating the JSD between software and hardware results with arbitrary data inputs when a random variable has two possible labels, such as in a foreground-background image segmentation. Analytical studies, as well as analysis on manylabel cases, is our future work.

Pillar Insight. Substantially worse RMSE or JSD results for a hardware accelerator means it is likely producing low quality application end-point results and more iterations or model/hardware design changes may be required.

IV. ANALYZING EXISTING HARDWARE

We apply the three pillars of statistical robustness on an existing hardware, the Stochastic Processing Unit (SPU) described in Sec. II-C.

A. Methodology

In this work we consider a single SPU, as it is sufficient to explore the statistical robustness questions. Development of an accelerator prototype with multiple SPUs is ongoing but is beyond the scope of this work. We primarily utilize MATLAB for both the FP64 software and for a functionally equivalent SPU simulator. Importantly, we also have SPU implementations in Verilog, Chisel, and HLS all with verified results. The single-run result quality in Tab. I are from the Verilog implementation.

We choose stereo vision and motion estimation as our test applications. Motion estimation infers the motion vector of image pixels in a frame of a video with respect to its next sequential frame. The concept of applying MRF Gibbs Sampling on motion estimation is similar to stereo vision as described in Sec. II-A, except the output label is a 2D motion vector of a pixel, indicating where the pixel moves to in the next frame. Each disparity per pixel in stereo vision is treated as a random variable. Each 2D motion vector per pixel in motion estimation is considered as two random variables x and y. We pick three datasets from Middlebury [2], [48] for each application, as in the previous work [57]. We use FP64 runs to find the application parameters (e.g., and ). Motion estimation has one set of parameters for all datasets, and stereo vision has two sets for all datasets. Some parameters can be optimized in a training process, which is beyond the scope of this paper. We also considered, but omit, image segmentation since it converges too fast (30 iterations for simulated annealing) to produce meaningful statistical measurements.

Recall the SPU supports two operating modes: pure sampling that produces the full estimated distribution (sampling), and optimization using simulated-annealing (optimization) that converges quickly to an exact result. For optimization, measuring Effective Sample Size (ESS) and Gelman Rubin’s is not conceptually meaningful, we evaluate sampling quality and convergence diagnostic for sampling only and goodness of fit for both modes. Parameter settings for each dataset are the same in sampling as in optimization, except for a different, fixed temperature. Our empirical results show that all datasets converge after 1,000 iterations for optimization and 3,000 for sampling, except for poster in stereo vision which takes only 500 and 1,500 iterations, respectively.

B. Results Analysis

1) Sampling Quality: We analyze ESS on SPU compared with the FP64 software by collecting the last 1,000 iterations of MCMC runs in the two applications. We evaluate the ESS per random variable and report the arithmetic mean. Fig. 4 shows an example ESS per random variable in stereo vision teddy dataset. Red regions indicate the random variables have zero variance, and thus ESS cannot be calculated. Due to truncating small probabilities to zero, more random variables in the SPU have zero variance than in the software. We consider a random variable with zero variance inactive. The percentage of inactive random variables with respect to the total (a.k.a. inactive percentage) in three stereo vision datasets are 26.9% for art, 44.6% for poster, and 26.2% for teddy in the SPU, compared with 0.3% for art, 4.1% for poster, and 1.4% for teddy in the software. Motion estimation exhibits similar

Fig. 4. ESS per random variable in stereo vision teddy. Red regions correspond to zero variance. Red regions in the SPU overlap high ESS regions with small variances in the software.

Fig. 5. Mean overall and active ESS (higher is better) in each dataset.

inactive percentages. Zero variance means the probability of a possible label is large enough that all random samples pick the same label, which can indicate convergence. The variance of corresponding inactive random variables in software are consistently small, indicating the random variable are likely to consistently pick the same label as well—a concentrated distribution. Therefore, high inactive percentage does not necessarily imply bad result quality.

Fig. 5 shows the ESS arithmetic mean for a single MCMC run per dataset. We verify that different runs have small ESS difference (< 6 in stereo vision). The mean “overall” ESS eliminates the random variables with zero variance in software and hardware, respectively. Fig. 4 reveals that the inactive regions in the SPU (red) correspond to the regions with high ESS in software due to small but non-zero variance (yellow), and thus overall ESS is biased toward software. Therefore, we also report the mean “active” ESS which only includes the regions with non-zero variance in both software and the SPU, where ESS is more meaningful. As a consequence, the active ESS eliminates the regions with small variance in the software, which can potentially benefit the SPU. The importance of these small variances needs to be evaluated and we are actively looking for methods to account for these regions. The software has 1.1-1.4higher active ESS than the SPU in stereo vision and around 1.2in motion estimation. This implies the SPU needs to run 1.1-1.4iterations to reach the same active ESS as the software.

Fig. 6. Convergence percentage (higher is better) results. The software and the SPU results in motion estimation are for the same number of iterations.

2) Convergence Diagnostic: We evaluate the convergence diagnostic of SPU using the proposed convergence percentage metric. Each convergence percentage value is collected from 10 runs per dataset. Each run forfeits the first half of iterations as the burn-in period and only keeps the second half, as proposed by Gelman et al. [15]. Recall a random variable is considered converged if or both within-chain variance W and between-chain variance B are zero. The 2D motion vector is considered as two random variables in motion estimation. Fig. 6 shows the results. The number of iterations is normalized with respect to SPU runs in stereo vision and are the same in motion estimation. Overall, convergence percentage is high in both the software and the SPU: more than 80% of random variables in stereo vision and more than 90% in motion estimation. More than 99.5% of random variables with W = 0 in SPU are converged. In stereo vision, the SPU reaches the same or better convergence percentage than software with 2iterations. This indicates the SPU needs to be at least 2faster in order to have a better overall performance in this application in terms of convergence percentage. Previous work [53], [57] shows that a pipeline with the same data interface provides the speedups of at least 2.8-5.5and up to 84. The SPU has higher convergence percentages than software in motion estimation, indicating the SPU converges faster in this application. Note that converging to a distribution faster does not necessarily lead to a better end-point result. The goodness of fit should be evaluated.

3) Goodness of Fit: The goodness of fit provides a similarity/difference measure on end-point results produced by the MCMC accelerator compared with results produced by the software. Fig. 7 shows the RMSE box plots of 10 MCMC runs per dataset compared with a reference result obtained by the mode of 10 software runs per dataset. Solid boxes show the range from 25th to 75th percentile. The horizontal lines inside each box are the medians of the data. The whiskers include the range of data that are not considered as outliers. We use 1.5interquartile range as the rule to decide outliers, showed as pluses. Whiskers of the software and the SPU overlap in all stereo vision benchmarks, suggesting close results. RMSE results in motion estimation are visually different in Fig. 7b. However, these differences are small considering the small

Fig. 7. Root Mean Squared Error (lower is better). Note that scales are different in (a) and (b) due to application differences.

Fig. 8. Application end-point result quality (lower is better).

scale of y-axis. The software and the SPU produce closer results in simulated annealing optimization mode.

Fig. 8 shows the end-point result quality using ground truth data and application metrics. Most whiskers of the software and the SPU overlap except for art in stereo vision and rubberwhale in motion estimation, both of which are in sampling mode. In optimization mode, software and SPU whiskers overlap, indicating the difference in end-point result quality is very small. This is consistent with the single-run results in Tab. I. Note that no obvious differences between software and the SPU are visually observable in the stereo vision disparity maps and motion estimation flow maps.

It seems intuitive to assume that FP64 software should produce no worse results than hardware with limited precision, truncation, and a simplified RNG. We find this assumption holds in most, but not all, cases. We observe that for sampling , dimetrodon has consistently lower end-point result error in the SPU than in the software (see Fig. 8b), which is consistent with convergence percentage results, shown in Fig. 6b. However, Fig. 7b shows that the SPU RMSEs are higher than FP64 software. To better understand this result, we examine perpixel end-point error differences between the software reference and the SPU, as shown in Fig. 9. Blue regions correspond to lower end-point error in the SPU and yellow to lower end-point error in software. The figure suggests the software and

Fig. 9. Dimetrodon end-point error difference () at pixel level. End-point error is 0.581 in software vs. 0.567 in SPU.

Fig. 10. Jensen-Shannon Divergence comparison between the SPU (left) and a previously proposed accelerator 1st-gen RSU-G (right).

the SPU have strengths in different regions, which potentially leads to a high RMSE compared to the software reference. This result indicates two insights: 1) software with higher precision does not necessarily produce better application end-point result quality, and 2) a higher RMSE compared to software does not always indicate worse application end-point result quality. Although bad pixel-percentage results are consistent with RMSE in stereo vision, the general link between the goodness of fit measure and the application end-point result quality needs to be further explored. This confirms collectively applying three pillars beyond end-point result is necessary to evaluate correctness.

We analyze the Jensen-Shannon Divergence of SPU relative to software with FP64 probability representation. Our goal is to provide insights about why hardware exhibits good or bad application end-point results and how it may perform with arbitrary input data. We assume each random variable has a binary distribution in this analysis. By sweeping a wide range of possible energy inputs E(i) from 0 to 255 in integer, corresponding to arbitrary data inputs, Fig. 10 plots JSD for two temperatures (1 and 10) and two different SPU microarchitectures: 1) the SPU described in Sec. II-C and 2) an early design described by Wang et al. [53]–called RSUG—that was shown to lack sufficient precision and dynamic range [57]. These results clearly show the problems with the 1st generation RSU-G. The more recent SPU JSDs are negligible in most energy inputs (blue regions), whereas the 1st-gen RSU-G has high JSD for many inputs (yellow). The interpretation of these absolute JSD values are up to domain experts. The 1st-gen RSU-G JSD is greater than 0.2 for most energy inputs and becomes worse when temperature decreases, which explains the bad application result quality. A key difference between these two designs is dynamic scaling for energy values in the SPU, which is not present in the 1st-gen RSU-G.

V. DESIGN SPACE EXPLORATION: A CASE STUDY

The previous section shows that architectural optimizations might have negative influence on the statistical robustness, even though producing comparable end-point results to FP64 software. The question is can we achieve desirable end-point result quality and statistical robustness without the commensurate overhead of FP64? To answer this question, we use the three pillars to explore the SPU design space.

A. Design Trade-offs

The SPU pipeline (Fig. 2) has several design parameters related to bit precision that potentially influence statistical robustness, including energy E(i) and , scaled and truncated probability , and RNG output bits. We fix energy E(i) and at 8 bits based on previous work [38], [57]. The number of bits in considerably influences the size of the energy-to-probability converter and the discrete sampler. We evaluate three design points with 4-bit, 6-bit, and 8-bit s. The influence of RNG output bits is small compared to and we find a 19-bit LFSR with 12-bit RNG outputs does not reduce the statistical robustness or result quality across all design points.

The SPU truncates all s to the nearest values, called approximation [57], enabling efficient energy-to-probability conversion by comparing the boundaries of energy values. Without approximation, a double-buffered 256-entry LUT is required to store the values to achieve a stall-free design. We evaluate the statistical robustness of each scaled probability design point with and without approximations. The above design parameters generally do not directly influ-ence the SPU per-iteration performance assuming the same interface at the same clock frequency. However, a design with lower precision may take more iterations to converge. On the other hand, higher precision requires more area and power affecting the number of SPU units in systems with limited area/power budget. Thus we discuss resource usage in Sec. V-C. Detailed system-level architecture investigations are beyond the scope of this paper.

B. Evaluating Design Parameters

Figs. 11-14 show our design space results. For brevity, we show only stereo vision results and highlight motion estimation results where needed. Starting from the current

Fig. 11. Stereo vision sampling quality in the design points.

Fig. 12. Stereo vision convergence percentage in the design points.

SPU design (“spu” label in figures), we analyze the statistical robustness by gradually increasing the precision. First, we replace the 19-bit LFSR sampler with a FP64 Mersenne Twister sampler [42] while keeping the front-end pipelines unchanged (“p4a”). Next, we increase the bit width of to 6-bit (“p6a”) and 8-bit (“p8a”), with approximation and then remove approximation corresponding to “p4”, “p6”, and “p8” in the figures. We further evaluate an upper-bound design (“pd”) that keeps front-end pipeline up to the scaled energy () output unchanged, but has a FP64 back-end for probability conversion and discrete sampling.

1) Sampling Quality: Fig. 11a shows the overall ESS, which omits random variables with zero variance for each design, respectively. Recall this metric can create bias that benefits software for variables with small but non-zero variance. Overall ESSs increase when more bits are added, partly as a result of fewer random variables with zero variance. Recall the SPU truncates small scaled probabilities to zero. Adding more bits keeps more possible labels with

Fig. 13. Stereo vision RMSE in the design points.

Fig. 14. Stereo vision application end-point result quality in the design points.

small probabilities available to be sampled. Fig. 11b indicates inactive percentage drops significantly when increasing bit size from 4 to 6. Interestingly, approximation helps reduce the inactive percentage under the same bit precision, but decreases ESS for 6-bit and 8-bit designs. Fig. 11c shows the active ESS for the teddy dataset. Recall active ESS masks out the random variables inactive in either software or the SPU. With approximation, increasing bit precision does not close the gap in active ESS with the software. Without

approximation, 6 or 8-bit have comparable overall and active ESS to software. As expected, increasing bit precision

TABLE II RESOURCE USAGE AND PERFORMANCE OF VARIOUS SPU IMPLEMENTATIONS ON ARRIA 10 FPGA.

decreases the difference between overall and active ESS due to fewer inactive variables. Results for motion estimation (omitted) are similar.

2) Convergence Diagnostic: Fig. 12 shows the convergence percentage increases with the increasing bit precision. In contrast to ESS, approximation improves the convergence percentage under the same bit precision. Hardware with 6-bit and 8-bit with and without approximation produces comparable convergence percentage to software. All designs except “p4” produce the same or higher convergence percentage for motion estimation (results not shown).

3) Goodness of Fit: Fig. 13 shows RMSE results compared with software reference results. Observable lower RMSEs can be found in stereo vision art when increasing the bit precision from 4 to 6. Differences of RMSEs are hard to notice when further increasing the precision given whiskers largely overlap in most datasets. Application end-point results in Fig. 14 exhibit the same trends. All designs produce comparable result quality to the software in simulated annealing (optimization), consistent with Tab. I. We highlight the following results for motion estimation (not shown): 1) the design parameters have negligible influence on application end-point result quality (end-point error) except “p4” in a couple of cases, which performs observably worse, 2) all designs except “p4” produce better end-point error than software for dimetrodon with sampling, 3) all designs produce slightly worse end-point error than the software for rubberwhale with sampling, 4) gaps exist between the software and all hardware designs including “pd” for RMSE, but not in end-point error, which confirms the importance of using all three pillars. Overall, optimization is more robust than sampling at producing good result quality across various designs. For both modes, increasing the scaled probability to 6 bits produces comparable goodness of fit results to the software.

C. Resource Usage

1) FPGA Resource Usage: We evaluate three different implementations of the SPU on an Intel Arria 10 FPGA [26]: 1) an optimized hand-written Verilog implementation with 4-bit scaled probability, 2) a High-Level Synthesis (HLS) implementation (HLS-int) that matches the hand-written Verilog (but using HLS basic integer data-types), and 3) an HLS implementation with a 32-bit floating-point back-end after energy computation (HLS-fp), eliminating the energy scaling stage. HLS-fp is developed in order to assess the

TABLE III AREA AND POWER ANALYSIS IN ASIC

option of directly using 32-bit floating-point representation inside the SPU for probability conversion and sampling. Tab. II shows the synthesis results. Despite higher resource requirement, HLS-int is close to the Verilog implementation in terms of performance. The resource usage can be further decreased by using reduced-precision integers. HLS-fp consumes significantly more resources compared to HLS-int and most importantly performs remarkably worse due to its lower operating frequency (369 MHz vs. 320MHz) and lower throughput (1 vs. 3 initiation interval cycles) caused by the floating-point addition [25]. Clearly, naively implementing the SPU in 32-bit floating-point consumes too much resources and significantly reduces the performance benefits. A humandesigned architecture is needed to improve efficiency.

2) ASIC: We estimate the ASIC area/power for various design points. Circuits elements written in Chisel are synthesized in a predictive 15nm library [41] using Synopsys Design Compiler. Memory elements (FIFOs and LUTs) are estimated using Cacti 7 [3] in 22nm technology, the smallest supported technology. The designs are verified in stereo vision art. Tab. III summarizes the results. Total area/power numbers are the sum of 15nm circuitry and 22nm memory elements. Power is estimated at 1GHz. Since Cacti requires widths in multiples of bytes, we estimate a double-buffered 2256-byte LUT (537 and 0.32 mW) and a 64-byte FIFO (215

and 0.18 mW) with 8-bit port, and linearly scale them to target widths of 4 and 6 bits. Overall, modest overheads are introduced when slightly increasing the bit precision from 4 to 8. All designs can run up to 3.3GHz, bounded by the SPU energy computation stage. Increasing the SPU from 4-bit to 6-bit precision while keeping the approximation (“p6a”) incurs 1.09area and 1.07power overheads, but has considerably better statistical robustness. Removing approximation (“p6a”) adds double-buffered LUTs for energy-to-probability conversion, thus incurs 1.20area and 1.10power overheads. Despite a 10% difference in area, we advocate the 6-bit designs without approximation in an ASIC for better sampling quality if area is not a major concern. The benefit from further increasing the bit-precision is marginal based on the previous analysis.

VI. LIMITATIONS AND FUTURE WORK

The proposal of three pillars is an important starting point towards quantifying the statistical robustness of probabilistic accelerators. This work selects the most popular metrics and estimation approaches from many within each pillar. The analysis of other metrics and methods (e.g., MCMC standard error [13]) might help identify limitations of selected metrics. The challenges of naively applying existing metrics motivate us to propose modified processes and a new metric for reporting scalar measures for sampling quality and convergence diagnostic. Our proposals are conceptually straightforward, but could benefit from domain experts developing metrics with stronger theoretical foundations. Ideally, formally proving bounds on the metrics for an accelerator could provide guarantees on statistical robustness, but is extremely difficult or impossible due to many hardware approximations techniques (e.g., truncation to zero). Additionally the adequateness of rule-of-thumb to determine convergence is under debate [52]. Applying the three pillars on other accelerators, applications, and models is our future work. Our proposed processes and metric apply to other accelerators and applications that cannot directly use the existing methods due to high dimensionality and potential random variables with zero variance. The effects of hardware approximations are unknown for applications that require information from variables with very low variance, such as rare event simulation.

VII. RELATED WORK

Sec. II-B discussed various specialized accelerators for probabilistic algorithms. Other accelerators exists for deterministic Bayesian Inference [24], [34]. A benchmark for Bayesian Inference models is proposed for performance evaluation [54]. Previous work addresses some statistical metrics for MCMC accelerators. Mansinghka et al. [38] evaluates data input precision using KL-divergence and QQ plots. Liu et al. [36] argues using ESS/second as a performance metric for MCMC samplers. Multiple goodness of fit statistical tests exist [19], [46]. These metrics all belong to one of three pillars proposed in this work and we argue all three pillars are needed to fully characterize the statistical robustness of an MCMC accelerator. Theoretical studies provide error bounds for MCMC with algorithmic approximation techniques given mathematical assumptions [14], [28]. An analytical tool for quantization error is proposed to help hardware design decision [35], but does not address the statistical robustness. Analytical and empirical studies have been done on evaluating limited precision in neural networks [10], [20], [23], [47].

VIII. CONCLUSION

Domain-specific accelerators require correctness evaluation. In probabilistic algorithms, statistical robustness is an important aspect of correctness defined by domain experts. Current methodologies often omit statistical robustness and thus lack a comprehensive definition of correctness. This work takes a first step toward defining metrics and a framework for evaluating correctness of probabilistic accelerators beyond application end-point result quality. We propose three pillars of statistical robustness: 1) sampling quality, 2) convergence diagnostic, and 3) goodness of fit. We apply the three pillars on an existing hardware accelerator and surface design issues that cannot be revealed by only using application end-point result quality. The three pillars guide the design space exploration and achieve considerable improvements in the statistical robustness by slightly increasing the bit precision. We believe our framework can help architects to design robust probabilistic hardware.

ACKNOWLEDGEMENTS

This project is supported in part by Intel, the Semiconductor Research Corporation and the National Science Foundation (CNS-1616947).

REFERENCES

[1] T. O. Bachir, M. Sawan, and J.-J. Brault, “A new hardware architecture for sampling the exponential distribution,” in Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on. IEEE, 2008, pp. 001 393–001 396.

[2] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski, “A database and evaluation methodology for optical flow,” International Journal of Computer Vision, vol. 92, no. 1, pp. 1–31, Mar 2011. [Online]. Available: https://doi.org/10.1007/s11263-010-0390-2

[3] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti 7: New tools for interconnect exploration in innovative off-chip memories,” ACM Transactions on Architecture and Code Optimization (TACO), vol. 14, no. 2, p. 14, 2017.

[4] S. S. Banerjee, Z. T. Kalbarczyk, and R. K. Iyer, “Acmc 2 : Accelerating markov chain monte carlo algorithms for probabilistic models,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, NY, USA: ACM, 2019, pp. 515– 528. [Online]. Available: http://doi.acm.org/10.1145/3297858.3304019

[5] S. T. Barnard, “Stochastic stereo matching over scale,” International Journal of Computer Vision, vol. 3, no. 1, pp. 17–32, May 1989.

[6] S. P. Brooks and A. Gelman, “General methods for monitoring conver- gence of iterative simulations,” Journal of Computational and Graphical Statistics, vol. 7, no. 4, pp. 434–455, 1998.

[7] R. Cai, A. Ren, N. Liu, C. Ding, L. Wang, X. Qian, M. Pedram, and Y. Wang, “Vibnn: Hardware acceleration of bayesian neural networks,” in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 2018, pp. 476–488.

[8] L. N. Chakrapani, B. E. Akgul, S. Cheemalavagu, P. Korkmaz, K. V. Palem, and B. Seshasayee, “Ultra-efficient (embedded) soc architectures based on probabilistic cmos (pcmos) technology,” in Proceedings of the conference on Design, automation and test in Europe: Proceedings. European Design and Automation Association, 2006, pp. 1110–1115.

[9] W. Cheng, L. Ma, T. Yang, J. Liang, and Y. Zhang, “Joint lung ct image segmentation: a hierarchical bayesian approach,” PloS one, vol. 11, no. 9, 2016.

[10] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems 28. Curran Associates, Inc., 2015, pp. 3123–3131. [Online]. Available: http://papers.nips.cc/paper/5647-binaryconnect-training-deep- neural-networks-with-binary-weights-during-propagations.pdf

[11] M. K. Cowles and B. P. Carlin, “Markov chain monte carlo convergence diagnostics: a comparative review,” Journal of the American Statistical Association, vol. 91, no. 434, pp. 883–904, 1996.

[12] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005, pp. 363–370.

[13] J. M. Flegal, M. Haran, and G. L. Jones, “Markov chain monte carlo: Can we trust the third significant figure?” Statistical Science, pp. 250– 260, 2008.

[14] R. Ge, H. Lee, and A. Risteski, “Simulated tempering langevin monte carlo ii: An improved proof using soft markov chain decomposition,” arXiv preprint arXiv:1812.00793, 2018.

[15] A. Gelman and D. B. Rubin, “Inference from iterative simulation using multiple sequences,” Statistical science, vol. 7, no. 4, pp. 457–472, 1992.

[16] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian restoration of images,” IEEE Transactions on pattern analysis and machine intelligence, no. 6, pp. 721–741, 1984.

[17] Z. Ghahramani, “Probabilistic machine learning and artificial intelli- gence,” Nature, vol. 521, no. 7553, pp. 452–459, 2015.

[18] L. Gong and J. M. Flegal, “A practical sequential stopping rule for high- dimensional markov chain monte carlo,” Journal of Computational and Graphical Statistics, vol. 25, no. 3, pp. 684–700, 2016.

[19] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Sch¨olkopf, and A. Smola, “A kernel two-sample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.

[20] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” CoRR, vol. abs/1502.02551, 2015. [Online]. Available: http://arxiv.org/abs/1502.02551

[21] G. Hamra, R. MacLehose, and D. Richardson, “Markov chain monte carlo: an introduction for epidemiologists,” International journal of epidemiology, vol. 42, no. 2, pp. 627–634, 2013.

[22] J. C. Hedstrom, C. H. Yuen, R. Chen, and B. Farhang-Boroujeny, “Achieving near map performance with an excited markov chain monte carlo mimo detector,” IEEE Transactions on Wireless Communications, vol. 16, no. 12, pp. 7718–7732, Dec 2017.

[23] J. L. Holt and J. . Hwang, “Finite precision error analysis of neural network hardware implementations,” IEEE Transactions on Computers, vol. 42, no. 3, pp. 281–290, March 1993.

[24] S. Hurkat and J. F. Mart´ınez, “Vip: A versatile inference processor,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 345–358.

[25] Intel, “Floating-point ip cores user guide,” 2019. [Online]. Available: https://www.intel.com/content/www/us/en/programmable/ documentation/eis1410764818924.html

[26] Intel, “Intel quartus prime software suite,” 2019. [Online]. Avail- able: https://www.intel.com/content/www/us/en/software/programmable/ quartus-prime/overview.html

[27] G. James, D. Witten, T. Hastie, and R. Tibshirani, An introduction to statistical learning. Springer, 2013, vol. 112.

[28] J. E. Johndrow, J. C. Mattingly, S. Mukherjee, and D. Dunson, “Optimal approximating markov chains for bayesian inference,” arXiv preprint arXiv:1508.03387, 2015.

[29] R. E. Kass, B. P. Carlin, A. Gelman, and R. M. Neal, “Markov chain monte carlo in practice: a roundtable discussion,” The American Statistician, vol. 52, no. 2, pp. 93–100, 1998.

[30] O. U. Khan and D. D. Wentzloff, “Hardware accelerator for probabilistic inference in 65-nm cmos,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 3, pp. 837–845, 2016.

[31] G. G. Ko and R. A. Rutenbar, “A case study of machine learning hardware: Real-time source separation using markov random fields via sampling-based inference,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 2477–2481.

[32] G. G. Ko and R. A. Rutenbar, “Real-time and low-power streaming source separation using markov random field,” J. Emerg. Technol. Comput. Syst., vol. 14, no. 2, pp. 17:1–17:22, May 2018. [Online]. Available: http://doi.acm.org/10.1145/3183351

[33] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information theory, vol. 37, no. 1, pp. 145–151, 1991.

[34] M. Lin, I. Lebedev, and J. Wawrzynek, “High-throughput bayesian computing machine with reconfigurable hardware,” in Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’10. New York, NY, USA: ACM, 2010, pp. 73–82. [Online]. Available: http: //doi.acm.org/10.1145/1723112.1723127

[35] M. D. Linderman, M. Ho, D. L. Dill, T. H. Meng, and G. P. Nolan, “Towards program optimization through automated analysis of numerical precision,” in Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, ser. CGO ’10. New York, NY, USA: ACM, 2010, pp. 230–237. [Online]. Available: http://doi.acm.org/10.1145/1772954.1772987

[36] S. Liu, G. Mingas, and C. Bouganis, “An exact mcmc accelerator under custom precision regimes,” in 2015 International Conference on Field Programmable Technology (FPT), Dec 2015, pp. 120–127.

[37] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh, “Tabla: A unified template-based framework for accelerating statistical machine learning,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 2016, pp. 14–26.

[38] V. Mansinghka and E. Jonas, “Building fast bayesian computing ma- chines out of intentionally stochastic, digital parts,” arXiv preprint arXiv:1402.4914, 2014.

[39] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, vol. 2, 2001, pp. 416–423 vol.2.

[40] L. Martino, V. Elvira, D. Luengo, J. Corander, and F. Louzada, “Or- thogonal parallel mcmc methods for sampling and optimization,” Digital Signal Processing, vol. 58, pp. 64–84, 2016.

[41] M. Martins, J. M. Matos, R. P. Ribas, A. Reis, G. Schlinker, L. Rech, and J. Michelsen, “Open cell library in 15nm freepdk technology,” in Proceedings of the 2015 Symposium on International Symposium on Physical Design, ser. ISPD ’15. New York, NY, USA: ACM, 2015, pp. 171–178. [Online]. Available: http: //doi.acm.org/10.1145/2717764.2717783

[42] M. Matsumoto and T. Nishimura, “Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator,” ACM Transactions on Modeling and Computer Simulation (TOMACS), vol. 8, no. 1, pp. 3–30, 1998.

[43] P. McClure, N. Rho, J. A. Lee, J. R. Kaczmarzyk, C. Y. Zheng, S. S. Ghosh, D. M. Nielson, A. G. Thomas, P. Bandettini, and F. Pereira, “Knowing what you know in brain segmentation using bayesian deep neural networks,” Frontiers in Neuroinformatics, vol. 13, p. 67, 2019.

[44] G. Mingas, L. Bottolo, and C.-S. Bouganis, “Particle mcmc algorithms and architectures for accelerating inference in state-space models,” International Journal of Approximate Reasoning, vol. 83, pp. 413–433, 2017.

[45] R. M. Neal, “Mcmc using hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, vol. 2, no. 11, p. 2, 2011.

[46] F. A. Saad, C. E. Freer, N. L. Ackerman, and V. K. Mansinghka, “A family of exact goodness-of-fit tests for high-dimensional discrete distributions,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1640–1649.

[47] C. Sakr, Y. Kim, and N. Shanbhag, “Analytical guarantees on numerical precision of deep neural networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70. International Convention Centre, Sydney, Australia: PMLR, 06–11 Aug 2017, pp. 3007–3016.

[48] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, no. 1-3, pp. 7–42, 2002.

[49] D. B. Thomas and W. Luk, “Using fpga resources for direct genera- tion of multivariate gaussian random numbers,” in Field-Programmable Technology, 2009. FPT 2009. International Conference on. IEEE, 2009, pp. 344–347.

[50] M. B. Thompson, “A comparison of methods for computing autocorre- lation time,” arXiv preprint arXiv:1011.0175, 2010.

[51] D. Vats, J. M. Flegal, and G. L. Jones, “Multivariate output analysis for markov chain monte carlo,” 2015.

[52] D. Vats and C. Knudson, “Revisiting the gelman-rubin diagnostic,” arXiv preprint arXiv:1812.09384, 2018.

[53] S. Wang, X. Zhang, Y. Li, R. Bashizade, S. Yang, C. Dwyer, and A. R. Lebeck, “Accelerating markov random field inference using molecular optical gibbs sampling units,” in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA: IEEE Press, 2016, pp. 558–569.

[54] Y. E. Wang, Y. Zhu, G. G. Ko, B. Reagen, G.-Y. Wei, and D. Brooks, “Demystifying bayesian inference workloads,” in 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 2019, pp. 177–189.

[55] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 681–688.

[56] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, “Unsupervised segmentation of natural images via lossy data compression,” Computer Vision and Image Understanding, vol. 110, no. 2, pp. 212 – 225, 2008. [Online]. Available: http://www.sciencedirect.com/science/article/ pii/S1077314207001300

[57] X. Zhang, R. Bashizade, C. LaBoda, C. Dwyer, and A. R. Lebeck, “Ar- chitecting a stochastic computing unit with molecular optical devices,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), June 2018, pp. 301–314.