Guiding a Diffusion Model with a Bad Version of Itself (Autoguidance)

hozy Summary

  • Classifier Free Guidance
    • cf.) Refer to the previous note on CFG for more details.
    • Strength)
      • Improves the image quality
        • Score-based models tend to produce outlier outputs and CFG eliminates those outliers.
    • Drawbacks)
      • Limits its usage as a general low-temperature sampling model.
      • Applicable only for conditional generation
      • Sampling trajectory can overshoot the desired conditional distribution.
        • Result)
          • Skewed and often overly simplified image compositions.
      • The prompt alignment and quality improvement effects cannot be controlled separately.
        • Unclear how exactly they are related to each other.
  • Autoguidance
    • Key Idea)
      • Use the same frame work as the CFG.
      • Generalize the \(D_0\). into the weaker version of \(D_1\)..
        • cf.)
          • In CFG, \(D_0\). was the unconditional version of \(D_1\)., which had penalty of marginalizing all class conditions.
          • Here, \(D_0\). is generalized to any weaker version of \(D_1\). that has some penalty in training, so that it will underfit the data distribution.
      • The sampling distribution will be guided in two directions of…
        • maximizing the log likelihood of the original model \(\nabla_\mathbf{x}\log p_1(\mathbf{x}\mid\mathbf{c};\sigma)\).
        • avoiding the discrepancy between the original and the weaker models’ distribution \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\).
    • Ideation)
      • Recall that the score matching objective tended to over-emphasize low probability regions of the data distribution.
      • This may due to various factors such as network architecture.
      • The problem is that we cannot expect to identify and characterize the specific issues a priori.
      • If we set up an additional weaker version of the same model, it may suffer more on similar errors in the same region.
        • \(D_1\). : the high-quality model
        • \(D_0\). : the poor-quality model
          • trained on the same task, conditioning, and data distribution as \(D_1\).
          • suffer from certain additional degradations
            • e.g.) low-capacity, under-training
      • By measuring the difference between the error made by the original model and the weaker one, we may identify where the error is made.


2. Background

Concept) Denoising Diffusion

  • Problem Setting)
    • \(p_{\text{data}}(\mathbf{x})\). : the data distribution
    • \(p(\mathbf{x};\sigma) = p_{\text{data}}(\mathbf{x}) * \mathcal{N}(\mathbf{x};\; \mathbf{0}, \sigma^2\mathbf{I})\). : a sequence of increasingly smoothed densities
      • where
        • \(\sigma\in[0, \sigma_{\max}]\). : the continuous noise level
          • Scheduled with \(\sigma(t)=t\). in this paper.
        • \(p(\mathbf{x};\sigma_{\max}) \approx \mathcal{N}(\mathbf{x};\; \mathbf{0}, \sigma_{\max}^2\mathbf{I})\). : pure noise!
    • \(\text{d}\mathbf{x}_{\sigma} = -\sigma \nabla_{\mathbf{x}_{\sigma}} \log p(\mathbf{x}_{\sigma}; \sigma) \text{d}\sigma\). : the probability flow ODE
      • where
        • \(\mathbf{x}_{\sigma}\sim p (\mathbf{x}_{\sigma};\sigma),\quad\forall\sigma\in[0, \sigma_{\max}]\).
        • \(\mathbf{x}_0 \sim p (\mathbf{x}_0; 0) = p_{\text{data}}(\mathbf{x}_0)\). : the denoised sample!
      • Meaning)
        • As the noise level decreases (\(\text{d}(-\sigma)\).), the sample \(\mathbf{x}_\sigma\). moves toward the direction that increases the score.
  • Sol.)
    • The score function \(\nabla_{\mathbf{x}}\log p(\mathbf{x};\sigma)\). for the given sample \(\mathbf{x}\). and the noise level \(\sigma\). to get the trajectory of \(\boldsymbol{\sigma}\). is intractable.
    • Instead, we may set up a neural network \(D_\theta(\mathbf{x};\sigma)\)., and train for the denoising task of
      • \(\theta = \displaystyle\arg\min_{\theta} \mathbb{E}_{\mathbf{y}\sim p_{\text{data}}, \sigma\sim p_{\text{train}}, \mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})} \big\Vert D_\theta(\mathbf{y+n};\;\sigma) - \mathbf{y} \big\Vert_2^2\).
        • where
          • \(\mathbf{y}\sim p_{\text{data}}\). is the clean data
            • cf.) \(\mathbf{x} = \mathbf{y + n},\quad\mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})\).
          • \(p_{\text{train}}\). controls the noise level distribution during training
            • e.g.) Uniform
        • Prop.)
          • \(\nabla_{\mathbf{x}} \log p(\mathbf{x}; \sigma) \approx \displaystyle\frac{D_\theta(\mathbf{x};\;\sigma)-\mathbf{x}}{\sigma^2}\).
  • Conditional Case)
    • Let \(\mathbf{c}\). be the label.
    • Given the label \(\mathbf{c}\)., we seek a sample from the conditional distribution \(p(\mathbf{x}\mid\mathbf{c};\;\sigma)\).
    • The denoiser network can be denoted as \(D_\theta(\mathbf{x};\sigma,\mathbf{c})\).


Concept) Classifier-Free Guidance (CFG)

  • Purpose)
    • Push the samples toward higher likelihood of the class label.
      • sacrificing the variety of generation
      • focusing on more canonical images that the network appears to be better capable of handling
    • To generate low temperature samples.
      • Why needed?)
        • The training objective of a DM aims to cover the entire data distribution.
        • Thus, for the low-probability regions, the model gets heavily penalized for not representing them.
        • However, the model does not have enough data to learn to generate good images corresponding to them.
        • As a result, low quality images are generated.
  • Methodology)
    • Train a denoiser network to operate in both conditional and unconditional setting.
      • \(D_0(\mathbf{x};\sigma,\mathbf{c})\). :
      • \(D_1(\mathbf{x};\sigma,\mathbf{c})\).
    • The unconditional generation task specifies a result to avoid.
  • Model)
    • General settings)
      • \(D_w(\mathbf{x};\sigma,\mathbf{c}) = w D_1(\mathbf{x};\sigma,\mathbf{c}) + (1-w)D_0(\mathbf{x};\sigma,\mathbf{c})\).
        • Choosing \(w\gt1\)., we may over-emphasize the output of \(D_1\)..
      • Recall that the denoiser and the score function were equivalent as
        • \(\nabla_{\mathbf{x}} \log p(\mathbf{x}; \sigma) \approx \displaystyle\frac{D_\theta(\mathbf{x};\;\sigma)-\mathbf{x}}{\sigma^2}\).
      • Adding conditional condition \(\mathbf{c}\). to the probability, we may rewrite as
        \(\begin{aligned} D_w(\mathbf{x}\mid\mathbf{c};\;\sigma) &\approx \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log p_w(\mathbf{x}\mid\mathbf{c}; \sigma) & \cdots(A) \\ &\varpropto \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log \Big( p_1(\mathbf{x}\mid\mathbf{c};\sigma)^w \cdot p_0(\mathbf{x}\mid\mathbf{c};\sigma)^{1-w} \Big) \\ &= \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log \left( p_1(\mathbf{x}\mid\mathbf{c};\sigma) \cdot \left[\frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\right]^{w-1} \right) & \cdots(B) \end{aligned}\).
      • From (A) and (B), we may get
        • \(\displaystyle\nabla_{\mathbf{x}} \log p_w(\mathbf{x}\mid\mathbf{c}; \sigma) = \nabla_{\mathbf{x}} \log p_1(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\).
      • Then considering the probability flow ODE, we have
        \(\begin{aligned} \text{d}\mathbf{x}_{\sigma} &= -\sigma \nabla_{\mathbf{x}_{\sigma}} \log p(\mathbf{x}_{\sigma}; \sigma) \text{d}\sigma \\ &= -\sigma \left( \underbrace{\nabla_{\mathbf{x}} \log p_1(\mathbf{x}\mid\mathbf{c};\sigma)}_{\text{generation by }p_1} + (w-1) \underbrace{\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}}_{\text{perturbation}} \right) \text{d}\sigma \\ \end{aligned}\).
        • Interpretation)
          • The image generation consists of
            • the generation by the density \(p_1\).
            • the perturbation
              • i.e.) the increase in the likelihood that a hypothetical classifier would attribute for the sample having come from density \(p_1\). rather than \(p_0\).
              • i.e.) The direction that makes \(p_1\). and \(p_0\). more distinctive
              • Larger \(w\)., stronger effect
    • Classifier Free Guidance)
      • Let \(D_0\). be unconditional : \(\mathbf{c}=\varnothing\).
        • i.e.) \(D_0(\mathbf{x};\sigma,\varnothing)\). with \(p_0(\mathbf{x}\mid\sigma)\).
      • Put \(D_1 = D_\theta\)..
      • Then the above dynamics goes
        \(\begin{aligned} \nabla_{\mathbf{x}} \log p_w(\mathbf{x}; \sigma) &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log \frac{p_\theta(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x};\sigma)} \\ &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log \frac{p(\mathbf{c}\mid\mathbf{x};\sigma)}{p(\mathbf{c};\sigma)} \\ &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log p(\mathbf{c}\mid\mathbf{x};\sigma) - (w-1) \nabla_{\mathbf{x}} \log p(\mathbf{c};\sigma) \\ &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log p(\mathbf{c}\mid\mathbf{x};\sigma) & (\because \nabla_{\mathbf{x}}\log p(\mathbf{c};\sigma)=0) \\ \end{aligned}\).
      • From the above, we may get
        • \(p_w(\mathbf{x}; \sigma) \varpropto p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) \cdot \underbrace{p(\mathbf{c}\mid\mathbf{x};\sigma)^{w-1}}_{\text{implied density}}\).
      • This is different from the valid heat diffusion that add noise as
        • \(p(\mathbf{x}\mid\mathbf{c}; \sigma) = p_{\text{data}}(\mathbf{x}\mid\mathbf{c}) * \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})\).
      • This paper argues that this leads to
        • distorted sampling trajectories
        • exaggerated truncation
        • mode dropping in results
        • over saturation of colors
  • Strength)
    • Improves the image quality
  • Drawbacks)
    • Limits its usage as a general low-temperature sampling model.
    • Applicable only for conditional generation
    • Sampling trajectory can overshoot the desired conditional distribution.
      • Result)
        • Skewed and often overly simplified image compositions.
    • The prompt alignment and quality improvement effects cannot be controlled separately.
      • Unclear how exactly they are related to each other.


3. Why does CFG improve image quality?

  • Summary)
    • Score-based models tend to produce outlier outputs.
      • Why?)
        • The score matching objective is closely related to the ML estimation.
        • The ML estimation that utilizes the KL-Divergence leads to a conservative fit of the data distribution.
          • i.e.) Attempting to cover all training samples by giving extreme penalties to the model if it severely underestimates the likelihood of any training example.
        • The conservative fit to data distribution results in generating strange and unlikely images from the distribution’s extreme.
          • cf.) They are not learnt accurately but included just to avoid the high loss penalty.
    • CFG eliminates the outliers.
      • How?)
        • The unconditional denoiser \(D_0\). underfits the data and results in alleviating the conservative fit of the score-based model.
          • Why?)
            • Recall that the CFG had two denoiser networks
              • \(D_0(\mathbf{x};\sigma)\). : the unconditional denoiser
              • \(D_1(\mathbf{x};\sigma,\mathbf{c})\). : the conditional denoiser
            • \(D_0\). faces a more difficult task on that it has to generate from all classes at once.
              • Whereas \(D_1\). can focus on a single class \(\mathbf{c}\)..
            • Given a small slice of training budget, $$D_0$ attains a worse fit to the data.
        • The perturbation term \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c})}\). concentrates the samples to be drawn at the “good side” (where the training data is rich) of the data manifold.
          • Why?)
            • Unconditionally learned distribution \(p_0\). has more spread out distribution compared to the conditional one \(p_1\)..
            • Thus, its ratio \(\displaystyle\frac{p_1}{p_0}\). decrease sharply with distance from the manifold.
            • Thus, the gradient \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x};\sigma)}\). point inward towards the data manifold.
            • This leads the model to sample more within the data manifold.


Model) 2D Toy Example

  • Desc.)
    • Dataset is designed to exhibit…
      • Low Local Dimensionality for the zero noise case (\(\sigma = 0\).)
        • highly anisotropic
          • The data distribution is not uniform in all directions
          • It is stretched out along certain meaningful axes
          • Extremely thin in other, nonsensical directions
        • narrow support
          • Support” refers to the region where data actually exists.
          • Even if you move along a meaningful “anisotropic” direction, straying even slightly from the path will result in an unrealistic image.
            • Why?) No data exists in that region!
      • Hierarchical emergence of local detail upon noise removal
        • As noise decreases (\(\sigma\rightarrow0\).)…
          • Isotropic distribution becomes anisotropic.
          • Support gets narrower.
  • Result) Sample Distribution
    • (a) : the underlying distribution
    • (b) : CFG with no guidance.
      • Unlikely outlier samples are generated
    • (c) : CFG
      • Overshoots the correction and produces a narrow distribution than the ground truth.
      • However, this overshooting does not appear to have an adverse effect on images.
    • (d) : Naive Truncation
      • e.g.)
        • GAN’s truncation techniques
        • Lowering temperature in generative language models
      • Desc.)
        • Concentrated in high-probability regions
        • But, isotropic with low diversity, oversimplified details, and monotone texture
    • (e) : Autoguidance
  • Result) Gradient of each probability distributions
    • (a) : Distribution learned by the conditional denoiser \(D_1(\mathbf{x}\mid\mathbf{c};\sigma)\). with intermediate noise level \(\sigma_{\text{mid}}\).
      • Learned density approximates the underlying ground truth quite well
      • But, it fails to replicate its sharper details.
    • (b) : Distribution learned by the unconditional Denoiser \(D_0(\mathbf{x};\sigma)\). with intermediate noise level \(\sigma_{\text{mid}}\).
      • Learned a further spread-out density compared to (a)
      • Looser fit to the data
    • (c) : The perturbation term
    • (d) : DM with no guidance
    • (e) : CFG with \(w=4\).



4. Our Method (Autoguidance)

  • Key Idea)
    • Use the same frame work as the CFG.
    • Generalize the \(D_0\). into the weaker version of \(D_1\)..
      • cf.)
        • In CFG, \(D_0\). was the unconditional version of \(D_1\)., which had penalty of marginalizing all class conditions.
        • Here, \(D_0\). is generalized to any weaker version of \(D_1\). that has some penalty in training, so that it will underfit the data distribution.
    • The sampling distribution will be guided in two directions of…
      • maximizing the log likelihood of the original model \(\nabla_\mathbf{x}\log p_1(\mathbf{x}\mid\mathbf{c};\sigma)\).
      • avoiding the discrepancy between the original and the weaker models’ distribution \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\).
  • Ideation)
    • Recall that the score matching objective tended to over-emphasize low probability regions of the data distribution.
    • This may due to various factors such as network architecture.
    • The problem is that we cannot expect to identify and characterize the specific issues a priori.
    • If we set up an additional weaker version of the same model, it may suffer more on similar errors in the same region.
      • \(D_1\). : the high-quality model
      • \(D_0\). : the poor-quality model
        • trained on the same task, conditioning, and data distribution as \(D_1\).
        • suffer from certain additional degradations
          • e.g.) low-capacity, under-training
    • By measuring the difference between the error made by the original model and the weaker one, we may identify where the error is made.


Concept) Synthetic Degradation

  • Why doing this?)
    • To validate the hypothesis that \(D_1\). and \(D_0\). must suffer from the same kind of degradations
  • Result)
    • As long as the degradations are compatible, autoguidance largely undoes the damage caused by the corruptions.
  • Methods)
    • Base Model)
      • EDM2-S trained on ImageNet 512 without dropout
      • FID : 2.56
    • Dropout
      • Degradation)
        • \(D_1\). : 5% dropout
        • \(D_0\). : 10% dropout
      • Result)
        • FID : 2.25, with \(w=2.25\).
    • Input Noise
      • Degradation)
        • \(D_1\). : Increase noise level \(\sigma\). by 10%
        • \(D_0\). : Increase noise level \(\sigma\). by 20%
      • Result)
        • FID : 2.56, with \(w=2.00\).
    • Mismatched Degradation)
      • Applying dropout to one model, while increasing noise on the other.
      • This did not improved the results.


Experiment)

  • Settings)
    • Data)
      • ImageNet at two resolutions \(512^2, 64^2\).
    • Models)
      • EDM2
        • Latent Diffusion for \(512^2\). images
        • Worked directly on RGB for \(64^2\). images
    • Degradations)
      • Shorter training time
      • Reduced capacity
  • Result)
    • Best result when both degradations were enabled
      • XS-sized guiding model with 1/16th of the training iterations!



Back to Main




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • (DM Reconst.) Ch.3 Score-Based Perspective - From EBMs to NCSN
  • Classifier-Free Diffusion Guidance (CFG)
  • Score-Based Generative Modeling through Stochastic Differential Equation
  • Flow Straight and Fast - Learning to Generate and Transfer Data with Rectified Flow (Rectified Flow)
  • Flow Matching for Generative Modeling (CFM)