Guiding a Diffusion Model with a Bad Version of Itself

2 months ago·arXiv

Abstract

1 Introduction

Denoising diffusion models [12, 37, 38, 39, 40] generate synthetic images by reversing a stochastic corruption process. Essentially, an image is revealed from pure noise by denoising it little by little in successive steps. A neural network that implements the denoiser (equivalently [44], the score function [15]) is a central design element, and various architectures have been proposed (e.g., [1, 5, 7, 14, 16, 19, 30]). Equally important are the details of the multi-step denoising process that corresponds mathematically to solving an ordinary [27, 38] or a stochastic [40] differential equation, for which many different parameterizations, solvers, and step schedules have been evaluated [17, 18, 22, 35, 46]. To control the output image, the denoiser is typically conditioned on a class label, an embedding of a text prompt, or some other form of conditioning input [28, 33, 36, 45].

The training objective of a diffusion model aims to cover the entire (conditional) data distribution. This causes problems in low-probability regions: The model gets heavily penalized for not representing them, but it does not have enough data to learn to generate good images corresponding to them. Classifier-free guidance (CFG) [13] has become the standard method for “lowering the sampling temperature”, i.e., focusing the generation on well-learned high-probability regions. By training a denoiser network to operate in both conditional and unconditional setting, the sampling process can be steered away from the unconditional result — in effect, the unconditional generation task specifies a result to avoid. This results in better prompt alignment and improved image quality, where the former effect is due to CFG implicitly raising the conditional part of the probability density to a power greater than one [8].

However, CFG has drawbacks that limit its usage as a general low-temperature sampling method. First, it is applicable only for conditional generation, as the guidance signal is based on the difference between conditional and unconditional denoising results. Second, because the unconditional and conditional denoisers are trained to solve a different task, the sampling trajectory can overshoot the desired conditional distribution, which leads to skewed and often overly simplified image compositions [21]. Finally, the prompt alignment and quality improvement effects cannot be controlled separately, and it remains unclear how exactly they relate to each other.

In this paper, we provide new insights into why CFG improves image quality and show how this effect can be separated out into a novel method that we call autoguidance. Our method does not suffer from the task discrepancy problem because we use an inferior version of the main model itself as the guiding model, with unchanged conditioning. This guiding model can be obtained by simply limiting, e.g., model capacity and/or training time. We validate the effectiveness of autoguidance in various synthetic test cases as well as in practical image synthesis in class-conditional and text-conditional settings. In addition, our method enables guidance for unconditional synthesis. In quantitative tests, the generated image distributions are improved considerably when measured using FID [10] and FDDINOv2 [42] metrics, setting new records in ImageNet-512 and ImageNet-64 generation.

We will make our implementation and pre-trained models publicly available.

2 Background

Denoising diffusion. Denoising diffusion generates samples from a distribution by iteratively denoising a sample of pure white noise, such that a noise-free random data sample is gradually revealed [12]. The idea is to consider heat diffusion of into a sequence of increasingly smoothed densities . For a large enough , from which we can trivially sample by drawing normally distributed white noise. The resulting sample is then evolved backward towards low noise levels by a probability flow ODE [18, 38, 40]

that maintains the property . Upon reaching , we obtain as desired.

In practice, the ODE is solved numerically by stepping along the trajectory defined by Equation 1. This requires evaluating the so-called score function [15] for a given sample x and noise level at each step. Rather surprisingly, we can approximate this vector using a neural network parameterized by weights trained for the denoising task

where controls the noise level distribution during training. Given , we can estimate , up to approximation errors due to, e.g., finite capacity or training time [18, 44]. As such, we are free to interpret the network as predicting either a denoised sample or a score vector, whichever is more convenient for the analysis at hand. Many reparameterizations and practical ODE solvers are possible, as enumerated by Karras et al. [18]. We follow their recommendations, including the schedule that lets us parameterize the ODE directly via noise level instead of a separate time variable t.

In most applications, each data sample x is associated with a label c, representing, e.g., a class index or a text prompt. At generation time, we control the outcome by choosing a label c and seeking a sample from the conditional distribution with . In practice, this is achieved by training a denoiser network that accepts c as an additional conditioning input.

Classifier-free guidance. For complex visual datasets, the generated images often fail to reproduce the clarity of the training images due to approximation errors made by finite-capacity networks. A broadly used trick called classifier-free guidance (CFG) [13] pushes the samples towards higher likelihood of the class label, sacrificing variety for “more canonical” images that the network appears to be better capable of handling.

In a general setting, guidance in a diffusion model involves two denoiser networks and . The guiding effect is achieved by extrapolating between the two denoising results by a factor w:

Trivially, setting w = 0 or w = 1 recovers the output of and , respectively, while choosing w > 1 over-emphasizes the output of . Recalling the equivalence of denoisers and scores [44], we can write

Thus, guidance grants us access to the score of the density implied in the parentheses. This score can be further written as [8, 13]

Substituting this expression into the ODE of Equation 1, this yields the standard evolution for generating images from , plus a perturbation that increases (for w > 1) the ratio of and as evaluated at the sample. The latter can be interpreted as increasing the likelihood that a hypothetical classifier would attribute for the sample having come from density rather than

In CFG, we train an auxiliary unconditional denoiser to denoise the distribution marginalized over c, and use this as . In practice, this is typically [13] done using the same network with an empty conditioning label, setting and . By Bayes’ rule, the extrapolated score vector becomes . During sampling, this guides the image to more strongly align with the specified class c.

It would be tempting to conclude that solving the diffusion ODE with the score function of Equation 5 produces samples from the data distribution specified by . Unfortunately this is not the case, because does not represent a valid heat diffusion of . Therefore, solving the ODE does not, in fact, follow the density. Instead, the samples are blindly pushed towards higher values of the implied density at each noise level during sampling. This can lead to distorted sampling trajectories, greatly exaggerated truncation, and mode dropping in the results [21], as well as over-saturation of colors [36]. Nonetheless, the improvement in image quality is often remarkable, and high guidance values are commonly used despite the drawbacks (e.g., [11, 31, 33, 36]).

3 Why does CFG improve image quality?

We begin by identifying the mechanism by which CFG improves image quality instead of only affecting prompt alignment. To illustrate why unguided diffusion models often produce unsatisfactory images, and how CFG remedies the problem, we study a 2D toy example where a small-scale denoiser network is trained to perform conditional diffusion in a synthetic dataset (Figure 1). The dataset is designed to exhibit low local dimensionality (i.e., highly anisotropic and narrow support) and hierarchical emergence of local detail upon noise removal. These are both properties that can be expected from the actual manifold of realistic images [4, 32]. For details of the setup, see Appendix C.

Score matching leads to outliers. Compared to sampling directly from the underlying distribution (Figure 1a), the unguided diffusion in Figure 1b produces a large number of extremely unlikely samples outside the bulk of the distribution. In the image generation setting, these would correspond to unrealistic and broken images.

We argue that the outliers stem from the limited capability of the score network combined with the score matching objective. It is well known that maximum likelihood (ML) estimation leads to a “conservative” fit of the data distribution [2] in the sense that the model attempts to cover all training samples. This is because the underlying Kullback–Leibler divergence incurs extreme penalties if the model severely underestimates the likelihood of any training sample. While score matching is generally not equal to ML estimation, they are closely related [12, 24, 40] and appear to exhibit broadly similar behavior. For example, it is known that for a multivariate Gaussian model, the optimal score matching fit coincides with the ML estimate [15]. Figures 2a and 2b show the learned score field and implied density in our toy example for two models of different capacity at an intermediate noise level. The stronger model envelops the data more tightly, while the weaker model’s density is more spread out.

From the perspective of image generation, a tendency to cover the entire training data becomes a problem: The model ends up producing strange and unlikely images from the data distribution’s

Figure 1: A fractal-like 2D distribution with two classes indicated with gray and orange regions. Approximately 99% of the probability mass is inside the shown contours. (a) Ground truth samples drawn directly from the orange class distribution. (b) Conditional sampling using a small denoising diffusion model generates outliers. (c) Classifier-free guidance (w = 4) eliminates outliers but reduces diversity by over-emphasizing the class. (d) Naive truncation via lengthening the score vectors. (e) Our method concentrates samples on high-probability regions without reducing diversity.

Figure 2: Closeup of the region highlighted in Figure 1c. (a) The implied learned density (green) at an intermediate noise level and its score vectors (log-gradients), plotted at representative sample points. The learned density approximates the underlying ground truth (orange) but fails to replicate its sharper details. (b) The weaker unconditional model learns a further spread-out density (red) with a looser fit to the data. (c) Guidance moves the points according to the gradient of the (log) ratio of the two learned densities (blue). As the higher-quality model is more sharply concentrated at the data, this field tends inward towards the data distribution. The corresponding gradient is simply the difference of respective gradients in (a) and (b), illustrated at selected points. (d) Sampling trajectories taken by standard unguided diffusion following the learned score , from noise level . The contours (orange) represent the ground truth noise-free density. (e) Guidance introduces an additional force shown in (c), causing the points to concentrate at the core of the data density during sampling.

extremities that are not learnt accurately but included just to avoid the high loss penalties. Furthermore, during training, the network has only seen real noisy images as inputs, and during sampling it may not be prepared to deal with the unlikely samples it is handed down from the higher noise levels.

CFG eliminates outliers. The effect of applying classifier-free guidance during generation is demonstrated in Figure 1c. As expected, the samples avoid the class boundary (i.e., there are no samples in the vicinity of the gray area), and entire branches of the distribution are dropped. We also observe a second phenomenon, where the samples have been pulled in towards the core of the manifold, and away from the low-probability intermediate regions. Seeing that this eliminates the unlikely outlier samples, we attribute the image quality improvement to it. However, mere boosting of the class likelihood does not explain this increased concentration.

We argue that this phenomenon stems from a quality difference between the conditional and unconditional denoiser networks. The denoiser faces a more difficult task of the two: It has to generate from all classes at once, whereas can focus on a single class for any specific sample. Given the more difficult task, and typically only a small slice of the training budget, the network attains a worse fit to the data.1 This difference in accuracy is apparent in respective plots of the learned densities in Figures 2a and 2b.

From our interpretation in Section 2, it follows that CFG is not only boosting the likelihood of the sample having come from the class c, but also that of having come from the higher-quality implied distribution. Recall that guidance boils down to an additional force (Equation 5) that pulls the samples towards higher values of . Plotting this ratio for our toy example in Figure 2c, along with corresponding gradients that guidance contributes to the ODE vector field, we see that the ratio generally decreases with distance from the manifold due to the denominator representing a more spread-out distribution, and hence falling off slower than the numerator . Consequently, the gradients point inward towards the data manifold. Each contour of the density ratio corresponds to a specific likelihood that a hypothetical classifier would assign on a sample being drawn from instead of . Because the contours roughly follow the local orientation and branching of the data manifold, pushing samples deeper into the “good side” concentrates them at the manifold.2

Discussion. We can expect the two models to suffer from inability to fit at similar places, but to a different degree. The predictions of the denoisers will disagree more decisively in these regions. As such, CFG can be seen as a form of adaptive truncation that identifies when a sample is likely to be under-fit and pushes it towards the general direction of better samples. Figures 2d and 2e show the effect over the course of generation: The truncation “overshoots” the correction and produces a narrower distribution than the ground truth, but in practice this does not appear to have an adverse effect on the images.

In contrast, a naive attempt at achieving this kind of truncation — inspired by, e.g., the truncation trick in GANs [3, 25] or lowering temperature in generative language models — would counteract the smoothing by uniformly lengthening the score vectors by a factor w > 1. This is illustrated in Figure 1d, where the samples are indeed concentrated in high-probability regions, but in an isotropic fashion that leaves the outer branches empty. In practice, images generated this way tend to show reduced variation, oversimplified details, and monotone texture.

4 Our method

We propose to isolate the image quality improvement effect by directly guiding a high-quality model with a poor model trained on the same task, conditioning, and data distribution, but suffering from certain additional degradations, such as low capacity and/or under-training. We call this procedure autoguidance, as the model is guided with an inferior version of itself.

In the context of our 2D toy example, this turns out to work surprisingly well. Figure 1e demonstrates the effect of using a smaller with fewer training iterations. As desired, the samples are pulled close to the distribution without systematically dropping any part of it.

To analyze why this technique works, recall that under limited model capacity, score matching tends to over-emphasize low-probability (i.e., implausible and under-trained) regions of the data distribution. Exactly where and how the problems appear depend on various factors such as network architecture, dataset, training details, etc., and we cannot expect to identify and characterize the specific issues a priori. However, we can expect a weaker version of the same model to make broadly similar errors in the same regions, only stronger. Autoguidance seeks to identify and reduce the errors made by the stronger model by measuring its difference to the weaker model’s prediction, and boosting it. When the two models agree, the perturbation is insignificant, but when they disagree, the difference indicates the general direction towards better samples.

As such, we can expect autoguidance to work if the two models suffer from degradations that are compatible with each other. Since any can be expected to suffer from, e.g., lack of capacity and lack of training — at least to some degree — it makes sense to choose so that it further exacerbates these aspects.

In practice, models that are trained separately or for a different number of iterations differ not only in accuracy of fit, but also in terms of random initialization, shuffling of the training data, etc. For guidance to be successful, the quality gap should be large enough to make the systematic spreading-out of the density outweigh these random effects.

Study on synthetic degradations. To validate our hypothesis that the two models must suffer from the same kind of degradations, we perform a controlled experiment using synthetic corruptions applied to a well-trained real-world image diffusion model. We create the main and guiding networks, and , by applying different degrees of a synthetic corruption to the base model. This construction allows us to use the untouched base model as grounding when measuring the FID effect of the various combinations of corruptions applied to and . We find that as long as the degradations are compatible, autoguidance largely undoes the damage caused by the corruptions:

• Base model: As the base model, we use EDM2-S trained on ImageNet-512 without dropout (FID = 2.56).

• Dropout: We construct by applying 5% dropout to the base model in a post-hoc fashion autoguidance, we reach the best result (FID = 2.55) with w = 2.25, matching the base model’s FID.

• Input noise: We construct by modifying the base model to add noise to the input images conditioning input of the denoiser is adjusted accordingly. The guiding model is constructed similarly, but with a noise level increase of 20% (FID = 9.73). Applying autoguidance, we reach the best result (FID = 2.56) with w = 2.00, again matching the base model’s FID.

• Mismatched degradations: If we corrupt by dropout and by input noise, or vice versa, guidance does not improve the results at all; in these cases, the best FID is obtained by setting w = 1, i.e., by disabling guidance and using the less corrupted exclusively.

While this experiment corroborates our main hypothesis, we do not suggest that guiding with these synthetic degradations would be useful in practice. A realistic diffusion model will not suffer from these particular degradations, so creating a guiding model by introducing them would not yield consistent truncation towards the data manifold.

5 Results

Our primary evaluation is carried out using ImageNet (ILSVRC2012) [6] at two resolutions: and . For ImageNet-512 we use latent diffusion [34], while ImageNet-64 works directly on RGB pixels. We take the current state-of-the-art diffusion model EDM2 [19] as our baseline.3 We use the EDM2-S and EDM2-XXL models with default sampling parameters: 32 deterministic steps with a order Heun sampler [18]. For most setups, a pre-trained model is publicly available, and in the remaining cases we train the models ourselves using the official implementation (Appendix B).

We use two degradations for the guiding model: shorter training time and reduced capacity compared to the main model. We obtain the best results by having both of these enabled. With EDM2-S, for example, we use an XS-sized guiding model that receives of the training iterations of the main model. We ablate the relative importance of the degradations as well as the sensitivity to these specific choices in Section 5.1. As the EDM2 networks are known to be sensitive to the guidance weight and EMA length [19], we search the optimal values for each case using a grid search.

Table 1 shows that our method improves FID [10] and FDDINOv2 [42] considerably. Using the small model (EDM2-S) in ImageNet-512, our autoguidance improves FID from 2.56 to 1.34. This beats the 1.68 achieved by the concurrently proposed CFG + Guidance Interval [21], and is the best result reported for this dataset regardless of the model size. Using the largest model (EDM2-XXL) further improves the record to 1.25. The FDDINOv2 records are similarly improved, with the large model

Table 1: Results on ImageNet-512 and ImageNet-64. The parameters of autoguidance refer to the capacity and amount training received by the guiding model. The latter is given relative to the number of training images shown to the main model (T). The columns EMAm and EMAg indicate the length parameter of the post-hoc EMA technique [19] for the main and guiding model, respectively.

Figure 3: Sensitivity w.r.t. autoguidance parameters, using EDM2-S on ImageNet-512. The shaded regions indicate the min/max FID over 3 evaluations. (a) Sweep over guidance weight w while keeping all other parameters unchanged. The curves correspond to how much the guiding model was trained relative to the number of images shown to the main model. (b) Sweep over guidance weight for different guiding model capacities. (c) Sweep over the two EMA length parameters for our best configuration, denoted with in (a) and (b).

lowering the record from 29.16 to 24.18. In ImageNet-64, the improvement is even larger; in this dataset, we set the new record FID and FDDINOv2 of 1.01 and 31.85, respectively.

A particular strength of autoguidance is that it can be applied to unconditional models as well. While conditional ImageNet generation may be getting close to saturation, the unconditional results remain surprisingly poor. EDM2-S achieves a FID of 11.67 in the unconditional setting, indicating that practically none of the generated images are of presentable quality. Enabling autoguidance lowers the FID substantially to 3.86, and the improvement in FDDINOv2 is similarly significant.

5.1 Ablations

Table 1 further shows that it is beneficial to allow independent EMA lengths for the main and guiding models. When both are forced to use the same EMA, FID worsens from 1.34 to 1.53 in ImageNet-512 (EDM2-S). We also measure the effect of each degradation (reduced training time, capacity) in isolation. If we set the guiding model to the same capacity as the main model and only train it for a shorter time, FID worsens to 1.51. If we instead train the reduced-capacity guiding model for as long

Figure 4: Example results for the Tree frog, Palace, Mushroom, Castle classes of ImageNet-512 using EDM2-S. Guidance weight increases to the right; rows are classifier-free guidance and our method.

as the main model, FID suffers a lot more, to 2.13. We can thus conclude that both degradations are beneficial and orthogonal, but a majority of the improvement comes from reduced training of the guiding model. Notably, all these ablations still outperform standard CFG in terms of FID.

Figure 3 probes the sensitivity to various hyperparameters. Our best result is obtained by training the guiding model as much as the main model, in terms of images shown during training. Further halving the training budget is almost equally good, while doubling the amount of training starts to slowly compromise the results. The results are quite insensitive to the choice of the guidance weight. In terms of the capacity of the guiding model, one step smaller (XS for EDM2-S) gave the best result. Two steps smaller (XXS) was also better than no capacity reduction (S), but started to show excessive sensitivity to the guidance weight. The results are also sensitive to the EMA length, similarly to the original EDM2. Post-hoc EMA [19] allows us to search the optimal parameters at a feasible cost.

We also explored several other degradations for the guiding model but did not find them to be beneficial. First, we tried reducing the amount of training data used for the guiding model, but this did not seem to improve the results over the baseline. Second, applying guidance interval [21] on top of our method reduced its benefits to some extent, suggesting that autoguidance is helpful at all noise levels. Third, deriving the guiding model from the main model using synthetic degradations did not work at all, providing further evidence that the guiding model needs to exhibit the same kinds of degradations that the main model suffers from. Fourth, we found that if the main model had been quantized, e.g., to improve inference speed, quantizing it to an even lower precision did not yield a useful guiding model.

5.2 Qualitative results

Figure 4 shows examples of generated images for ImageNet-512. Both CFG and our method tend to improve the perceptual quality of images, guiding the results towards clearer realizations as the guidance weight increases. However, CFG seems to have a tendency to head towards a more limited number of canonical images [21] per class, while our method produces a wider gamut of image compositions. An example is the atypical image of a Palace at w = 1, which CFG converts to a somewhat idealized depiction as w increases. Sometimes the unguided sample contains incompatible elements of multiple possible images, such as the Castle image, which includes a rough sketch of

Figure 5: Results for DeepFloyd IF [41] using the prompt “A blue jay standing on a large basket of rainbow macarons”. The rows correspond to guidance weights . The leftmost column shows results for CFG and the rightmost for autoguidance (XL-sized model guided by M-sized one). The middle columns correspond to blending between the two. See Appendix A for more examples.

two or three castles of unrelated styles. In this instance, CFG apparently struggles to decide what to do, whereas our method first builds the large red element into a castle, and with increased guidance focuses on the red foreground object. A higher number of possible output images is consistent with a lower FID, implying better coverage of the training data.

In order to study our method in the context of large-scale image generators, we apply it to DeepFloyd IF [41]. We choose this baseline because multiple differently-sized models are publicly available. Ideally we could have also used an earlier snapshot as the guiding model, but those were not available. DeepFloyd IF generates images as a cascade of three diffusion models: a base model and two super-resolution stages. We apply our method to the base model only, while the subsequent stages always use CFG. Figure 5 demonstrates the effect of CFG, our method, and their various combinations. To combine autoguidance with CFG, we extend Equation 3 to cover multiple guiding models as proposed by Liu et al. [23] and distribute the total guidance weight among them using linear interpolation (see Appendix B.2 for details). While CFG improves the image quality significantly, it also simplifies the style and layout of the image towards a canonical depiction. Our method similarly improves the image quality, but it better preserves the image’s style and visual complexity. We hope that using both guiding methods simultaneously will serve as a new, useful artistic control.

6 Discussion and Future work

We have shown that classifier-free guidance entangles several phenomena together, and that a different perspective together with simple practical changes opens up an entire new design space. In addition to removing the superfluous connection to conditioning, this enables significantly better results.

Potential directions for future work include formally proving the conditions that allow autoguidance to be beneficial, and deriving good rules of thumb for selecting the best guiding model. Our suggestion — an early snapshot of a smaller model — is easy to satisfy in principle, but these are not available for current large-scale image generators in practice. Such generators are also often trained in successive stages where the training data may change at some point, causing potential distribution shifts between snapshots that would violate our assumptions.

Acknowledgments

We thank David Luebke, Janne Hellsten, Ming-Yu Liu, and Alex Keller for discussions and comments, and Tero Kuosmanen and Samuel Klenberg for maintaining our compute infrastructure.

References

[1] F. Bao, S. Nie, K. Xue, Y. Cao, C. Li, H. Su, and J. Zhu. All are worth words: A ViT backbone for diffusion models. In Proc. CVPR, 2023.

[2] C. M. Bishop. Neural networks for pattern recognition. Oxford University Press, USA, 1995.

[3] A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In Proc. ICLR, 2019.

[4] B. C. A. Brown, A. L. Caterini, B. L. Ross, J. C. Cresswell, and G. Loaiza-Ganem. Verifying the union of manifolds hypothesis for image data. In Proc. ICLR, 2023.

[5] K. Crowson, S. A. Baumann, A. Birch, T. M. Abraham, D. Z. Kaplan, and E. Shippole. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers. In Proc. ICML, 2024.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. CVPR, 2009.

[7] P. Dhariwal and A. Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS, 2021.

[8] S. Dieleman. Guidance: A cheat code for diffusion models. Blog post. https://sander.ai/2022/05/ 26/guidance.html, 2022.

[9] D. Hendrycks and K. Gimpel. Gaussian error linear units (GELUs). CoRR, abs/1606.08415, 2016.

[10] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. NIPS, 2017.

[11] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. Imagen Video: High definition video generation with diffusion models. CoRR, abs/2210.02303, 2022.

[12] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. NeurIPS, 2020.

[13] J. Ho and T. Salimans. Classifier-free diffusion guidance. In Proc. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.

[14] E. Hoogeboom, J. Heek, and T. Salimans. Simple diffusion: End-to-end diffusion for high resolution images. In Proc. ICML, 2023.

[15] A. Hyvärinen. Estimation of non-normalized statistical models by score matching. JMLR, 6(24):695–709, 2005.

[16] A. Jabri, D. J. Fleet, and T. Chen. Scalable adaptive computation for iterative generation. In Proc. ICML, 2023.

[17] A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas. Gotta go fast when generating data with score-based models. CoRR, abs/2105.14080, 2021.

[18] T. Karras, M. Aittala, T. Aila, and S. Laine. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022.

[19] T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine. Analyzing and improving the training dynamics of diffusion models. In Proc. CVPR, 2024.

[20] D. Kim, Y. Kim, S. J. Kwon, W. Kang, and I.-C. Moon. Refining generative process with discriminator guidance in score-based diffusion models. In Proc. ICML, 2023.

[21] T. Kynkäänniemi, M. Aittala, T. Karras, S. Laine, T. Aila, and J. Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models. CoRR, abs/2404.07724, 2024.

[22] L. Liu, Y. Ren, Z. Lin, and Z. Zhao. Pseudo numerical methods for diffusion models on manifolds. In Proc. ICLR, 2022.

[23] N. Liu, S. Li, Y. Du, A. Torralba, and J. B. Tenenbaum. Compositional visual generation with composable diffusion models. In Proc. ECCV, 2022.

[24] S. Lyu. Interpretation and generalization of score matching. In Proc. UAI, 2009.

[25] M. Marchesi. Megapixel size image creation using generative adversarial networks. CoRR, abs/1706.00082, 2017.

[26] P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALLE 2 preview – risks and limitations. OpenAI, 2022.

[27] A. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In Proc. ICML, 2021.

[28] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. ICML, 2022.

[29] M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without supervision. TMLR, 2024.

[30] W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proc. ICCV, 2023.

[31] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. DreamFusion: Text-to-3D using 2D diffusion. In Proc. ICLR, 2023.

[32] P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. Goldstein. The intrinsic dimension of images and its impact on learning. In Proc. ICLR, 2021.

[33] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with CLIP latents. CoRR, abs/2204.06125, 2022.

[34] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022.

[35] A. Sabour, S. Fidler, and K. Kreis. Align your steps: Optimizing sampling schedules in diffusion models. In Proc. ICML, 2024.

[36] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Proc. NeurIPS, 2022.

[37] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, 2015.

[38] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021.

[39] Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Proc. NeurIPS, 2019.

[40] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021.

[41] Stability AI. DeepFloyd IF. GitHub repository. https://github.com/deep-floyd/IF, 2023.

[42] G. Stein, J. C. Cresswell, R. Hosseinzadeh, Y. Sui, B. L. Ross, V. Villecroze, Z. Liu, A. L. Caterini, J. E. T. Taylor, and G. Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Proc. NeurIPS, 2023.

[43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception architecture for computer vision. In Proc. CVPR, 2016.

[44] P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.

[45] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proc. ICCV, 2023.

[46] Q. Zhang and Y. Chen. Fast sampling of diffusion models with exponential integrator. In Proc. ICLR, 2023.

Appendices A Additional results

Figure 6 shows additional results using DeepFloyd IF, similar to Figure 5.

Figure 6: Additional results for DeepFloyd IF [41]. The rows correspond to guidance weights . CFG and our method (XL-sized model guided by M-sized one) on the leftmost and rightmost column, respectively. The middle columns correspond to blending between the two.

B Implementation details

We performed our main experiments on top of the publicly available EDM2 [19] codebase4 using NVIDIA A100 GPUs, Python 3.11.7, PyTorch 2.2.0, CUDA 11.8, and CuDNN 8.9.7. Since our method only involves using a different guiding model during sampling, we were able to perform all measurements using the existing command-line scripts. We only had to modify one line of code in the sampling loop to pass in the class label to the guiding network in addition to the main network. Algorithm 1 demonstrates the steps needed to reproduce one of our results from Table 1; the rest can be reproduced by repeating the same steps with different models and hyperparameters as indicated in the table. For the four cases where a pre-trained model was unavailable, we trained the models ourselves as detailed in Algorithm 2.

B.1 Hyperparameter search

We optimized the autoguidance parameters for each configuration in Table 1 and Figure 3 using automated grid search. We performed the search separately for FID and FDDINOv2 across the space of five parameters:

• Model capacity: All available model capacities (EDM2-XXS, EDM2-XS, etc.) up to and including the capacity of the main model.

• Training time: Number of training images in powers of two. We found that the sweet spot was always within the range — there was no need to go for lower or higher values.

• Guidance weight: All values within [1.00, 3.50] at regular intervals of 0.05. • EMA lengths: All values within [0.010, 0.250] at regular intervals of 0.005. We treated the EMA length of the main model and the guiding model as two separate parameters.

In order to reduce the computational workload, we pruned the search space by considering only a local neighborhood of parameters around the best FID or FDDINOv2 found thus far. Whenever the result improved, we placed a new local grid around the corresponding parameters, resulting in gradual convergence towards the global optimum. Once we reached the optimum, we further re-evaluated the nearby parameter choices two more times to account for the effect of random noise. As such, each result reported in Table 1 and Figure 3 represents the best of three evaluations. The typical range of random variation is indicated by the shaded regions in Figure 3.

In total, the grid search resulted in roughly 30,000 metric evaluations across all of our configurations. In each metric evaluation, the main cost comes from generating 50,000 random images, which takes around 30–60 minutes using eight A100 GPUs and consumes approximately 2–5 kWh of energy, depending on model size. The overall energy consumption of our entire project was thus in the ballpark of 60–150 MWh.

B.2 DeepFloyd IF experiments

In order to apply CFG and autoguidance simultaneously, we extend Equation 3 to cover multiple guiding models as proposed by Liu et al. [23]. In this case, we have three models: the main model , an unconditional model , and a reduced-capacity conditional model . The guided denoising result is then defined as

where correspond to the guidance weights for CFG and autoguidance, respectively. To interpolate between the two methods, we further define where w indicates the desired total amount of guidance and is a linear interpolation factor.

DeepFloyd IF [41] uses a three-stage cascade: a base model followed by two super-resolution stages. We apply our method only to the base model, while the super-resolution stages always use CFG with their default weights (4 and 9). We use their Stochastic DDPM sampler with default settings: 100, 75, 50 steps for the base model and subsequent super-resolution stages, respectively. Dynamic thresholding [36] is used for all stages.

C Details of the 2D toy example

In this section, we describe the construction of the 2D toy dataset used in the analysis of Section 3, as well as the associated model architecture, training setup, and sampling parameters. We will make all related code publicly available.

Dataset. For each of the two classes c, we model the fractal-like data distribution as a mixture of Gaussians , where , and represent the weight, mean, and covariance matrix of each component i, respectively. This lets us calculate the ground truth scores and probability densities analytically and, consequently, to visualize them without making any additional assumptions. The probability density for a given class is given by

Applying heat diffusion to , we obtain a sequence of increasingly smoothed densities parameterized by noise level

The score function of is then given by

We construct to represent a thin tree-like structure by starting with one main “branch” and recursively subdividing it into smaller ones. Each branch is represented by 8 anisotropic Gaussian components and the subdivision is performed 6 times, decaying after each subdivision and

slightly randomizing the lengths and orientations of the two resulting sub-branches. This yields components per class and components in total. We define the coordinate system so that the mean and standard deviation of , marginalized over c, are equal to 0 and along each axis, respectively, matching the recommendations by Karras et al. [18].

Models. We implement the denoiser models as simple multi-layer perceptrons, utilizing the magnitude-preserving design principles from EDM2 [19]. To be able to visualize the implied probability densities in Figure 2, we design the model interface so that for a given noisy sample, each model outputs a single scalar representing the logarithm of the corresponding unnormalized probability density, as opposed to directly outputting the denoised sample or the score vector. Concretely, let us denote the output of a given model by . The corresponding normalized probability density is then given by

By virtue of defining this way, we can derive the score vector, and by extension, the denoised sample, from through automatic differentiation:

Besides Equation 12, we also tried out the alternative formulations where the model outputs the score vector or the denoised sample directly. The results produced by all these variants were qualitatively more or less identical; we chose to go with the formulation above purely for convenience.

To connect the above definition of to the raw network layers, we apply preconditioning using the same general principles as in EDM [18]. Denoting the function represented by the raw network layers as , we define

and the sum is taken over the n output features of . We scale the output of by a learned scaling factor that we initialize to zero.

The goal of Equation 14 is to satisfy the following three requirements:

• The input of should have zero mean and unit magnitude. This is achieved through the division by

• After initialization, should represent the best possible first-order approximation of the correct solution. This is achieved through the term, as well as the fact that after initialization.

• After training, should have approximately unit magnitude. This is achieved throughthe division by

In practice, we use an MLP with one input layer and four hidden layers, interspersed with SiLU [9] activation functions and implemented using the magnitude-preserving primitives from EDM2 [19]. The input is a 5-dimensional vectorfor the orange class, the gray class, and 0 for the unconditional case. The output of each hidden layer has n features, where

Training. Given that we have the exact score function of the ground truth distribution readily available (Equation 9), we train the models using exact score matching [15] for simplicity and increased robustness. For a given class c, we thus define the loss function as

where is realized as [18]. As an alternative to exact score matching, we also experimented with the more commonly used denoising score matching, but did not observe any noticeable differences in model behavior or training dynamics.

We train for 4096 iterations and for 512 iterations using a batch size of 4096 samples. In terms of hyperparameters, we set rate decay schedule with and iterations, along with a power function EMA profile [19] with . Overall, the setup is robust with respect to the hyperparameters; the phenomena illustrated in Figures 1 and 2 remain unchanged across a wide range of parameter choices.

Sampling. We use the standard EDM sampler [18] with N = 32 Heun steps (NFE = 63), , and . We chose the values of N and to be much higher than what is actually needed for this dataset in order to avoid potential discretization errors from affecting our conclusions. In Figure 1, we set w = 4 for CFG and w = 3 for autoguidance, and multiply the score vectors (Equation 12) by 1.40 for naive truncation. In Figure 2, we set

D Broader societal impact

Generative modeling, including images and videos, has significant misuse potential. It can trigger negative consequences within the society in several ways. The primary concerns include various types of disinformation, but also the potential to amplify sterotypes and unwanted biases [26]. Our improvements to the sample quality can make the results even more believable, even when used for disinformation. That said, we do not unlock any novel uses of the technology.

E Licenses

• EDM2 models [19]: Creative Commons BY-NC-SA 4.0 license

• DeepFloyd IF models [41]: Modified MIT license

• Stable Diffusion VAE model [34]: CreativeML Open RAIL++-M license

• InceptionV3 model [43]: Apache 2.0 license

• DINOv2 model [29]: Apache 2.0 license

• ImageNet dataset [6]: Custom non-commercial license

Designed for Accessibility and to further Open Science