Guiding a Diffusion Model with a Bad Version of Itself (Autoguidance)
hozy Summary
- Classifier Free Guidance
- cf.) Refer to the previous note on CFG for more details.
- Strength)
- Improves the image quality
- Score-based models tend to produce outlier outputs and CFG eliminates those outliers.
- cf.) details
- Score-based models tend to produce outlier outputs and CFG eliminates those outliers.
- Improves the image quality
- Drawbacks)
- Limits its usage as a general low-temperature sampling model.
- Applicable only for conditional generation
- Sampling trajectory can overshoot the desired conditional distribution.
- Result)
- Skewed and often overly simplified image compositions.
- Result)
- The prompt alignment and quality improvement effects cannot be controlled separately.
- Unclear how exactly they are related to each other.
- Autoguidance
- Key Idea)
- Use the same frame work as the CFG.
- Generalize the \(D_0\). into the weaker version of \(D_1\)..
- cf.)
- In CFG, \(D_0\). was the unconditional version of \(D_1\)., which had penalty of marginalizing all class conditions.
- Here, \(D_0\). is generalized to any weaker version of \(D_1\). that has some penalty in training, so that it will underfit the data distribution.
- cf.)
- The sampling distribution will be guided in two directions of…
- maximizing the log likelihood of the original model \(\nabla_\mathbf{x}\log p_1(\mathbf{x}\mid\mathbf{c};\sigma)\).
- avoiding the discrepancy between the original and the weaker models’ distribution \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\).
- Ideation)
- Recall that the score matching objective tended to over-emphasize low probability regions of the data distribution.
- This may due to various factors such as network architecture.
- The problem is that we cannot expect to identify and characterize the specific issues a priori.
- If we set up an additional weaker version of the same model, it may suffer more on similar errors in the same region.
- \(D_1\). : the high-quality model
- \(D_0\). : the poor-quality model
- trained on the same task, conditioning, and data distribution as \(D_1\).
- suffer from certain additional degradations
- e.g.) low-capacity, under-training
- By measuring the difference between the error made by the original model and the weaker one, we may identify where the error is made.
- Key Idea)
2. Background
Concept) Denoising Diffusion
- Problem Setting)
- \(p_{\text{data}}(\mathbf{x})\). : the data distribution
- \(p(\mathbf{x};\sigma) = p_{\text{data}}(\mathbf{x}) * \mathcal{N}(\mathbf{x};\; \mathbf{0}, \sigma^2\mathbf{I})\). : a sequence of increasingly smoothed densities
- where
- \(\sigma\in[0, \sigma_{\max}]\). : the continuous noise level
- Scheduled with \(\sigma(t)=t\). in this paper.
- \(p(\mathbf{x};\sigma_{\max}) \approx \mathcal{N}(\mathbf{x};\; \mathbf{0}, \sigma_{\max}^2\mathbf{I})\). : pure noise!
- \(\sigma\in[0, \sigma_{\max}]\). : the continuous noise level
- where
- \(\text{d}\mathbf{x}_{\sigma} = -\sigma \nabla_{\mathbf{x}_{\sigma}} \log p(\mathbf{x}_{\sigma}; \sigma) \text{d}\sigma\). : the probability flow ODE
- where
- \(\mathbf{x}_{\sigma}\sim p (\mathbf{x}_{\sigma};\sigma),\quad\forall\sigma\in[0, \sigma_{\max}]\).
- \(\mathbf{x}_0 \sim p (\mathbf{x}_0; 0) = p_{\text{data}}(\mathbf{x}_0)\). : the denoised sample!
- Meaning)
- As the noise level decreases (\(\text{d}(-\sigma)\).), the sample \(\mathbf{x}_\sigma\). moves toward the direction that increases the score.
- where
- Sol.)
- The score function \(\nabla_{\mathbf{x}}\log p(\mathbf{x};\sigma)\). for the given sample \(\mathbf{x}\). and the noise level \(\sigma\). to get the trajectory of \(\boldsymbol{\sigma}\). is intractable.
- Instead, we may set up a neural network \(D_\theta(\mathbf{x};\sigma)\)., and train for the denoising task of
- \(\theta = \displaystyle\arg\min_{\theta} \mathbb{E}_{\mathbf{y}\sim p_{\text{data}}, \sigma\sim p_{\text{train}}, \mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})} \big\Vert D_\theta(\mathbf{y+n};\;\sigma) - \mathbf{y} \big\Vert_2^2\).
- where
- \(\mathbf{y}\sim p_{\text{data}}\). is the clean data
- cf.) \(\mathbf{x} = \mathbf{y + n},\quad\mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})\).
- \(p_{\text{train}}\). controls the noise level distribution during training
- e.g.) Uniform
- \(\mathbf{y}\sim p_{\text{data}}\). is the clean data
- Prop.)
- \(\nabla_{\mathbf{x}} \log p(\mathbf{x}; \sigma) \approx \displaystyle\frac{D_\theta(\mathbf{x};\;\sigma)-\mathbf{x}}{\sigma^2}\).
- where
- \(\theta = \displaystyle\arg\min_{\theta} \mathbb{E}_{\mathbf{y}\sim p_{\text{data}}, \sigma\sim p_{\text{train}}, \mathbf{n}\sim\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})} \big\Vert D_\theta(\mathbf{y+n};\;\sigma) - \mathbf{y} \big\Vert_2^2\).
- Conditional Case)
- Let \(\mathbf{c}\). be the label.
- Given the label \(\mathbf{c}\)., we seek a sample from the conditional distribution \(p(\mathbf{x}\mid\mathbf{c};\;\sigma)\).
- The denoiser network can be denoted as \(D_\theta(\mathbf{x};\sigma,\mathbf{c})\).
Concept) Classifier-Free Guidance (CFG)
- Purpose)
- Push the samples toward higher likelihood of the class label.
- sacrificing the variety of generation
- focusing on more canonical images that the network appears to be better capable of handling
- To generate low temperature samples.
- Why needed?)
- The training objective of a DM aims to cover the entire data distribution.
- Thus, for the low-probability regions, the model gets heavily penalized for not representing them.
- However, the model does not have enough data to learn to generate good images corresponding to them.
- As a result, low quality images are generated.
- Why needed?)
- Push the samples toward higher likelihood of the class label.
- Methodology)
- Train a denoiser network to operate in both conditional and unconditional setting.
- \(D_0(\mathbf{x};\sigma,\mathbf{c})\). :
- \(D_1(\mathbf{x};\sigma,\mathbf{c})\).
- The unconditional generation task specifies a result to avoid.
- Train a denoiser network to operate in both conditional and unconditional setting.
- Model)
- General settings)
- \(D_w(\mathbf{x};\sigma,\mathbf{c}) = w D_1(\mathbf{x};\sigma,\mathbf{c}) + (1-w)D_0(\mathbf{x};\sigma,\mathbf{c})\).
- Choosing \(w\gt1\)., we may over-emphasize the output of \(D_1\)..
- Recall that the denoiser and the score function were equivalent as
- \(\nabla_{\mathbf{x}} \log p(\mathbf{x}; \sigma) \approx \displaystyle\frac{D_\theta(\mathbf{x};\;\sigma)-\mathbf{x}}{\sigma^2}\).
- Adding conditional condition \(\mathbf{c}\). to the probability, we may rewrite as
\(\begin{aligned} D_w(\mathbf{x}\mid\mathbf{c};\;\sigma) &\approx \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log p_w(\mathbf{x}\mid\mathbf{c}; \sigma) & \cdots(A) \\ &\varpropto \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log \Big( p_1(\mathbf{x}\mid\mathbf{c};\sigma)^w \cdot p_0(\mathbf{x}\mid\mathbf{c};\sigma)^{1-w} \Big) \\ &= \mathbf{x} + \sigma^2 \nabla_{\mathbf{x}} \log \left( p_1(\mathbf{x}\mid\mathbf{c};\sigma) \cdot \left[\frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\right]^{w-1} \right) & \cdots(B) \end{aligned}\). - From (A) and (B), we may get
- \(\displaystyle\nabla_{\mathbf{x}} \log p_w(\mathbf{x}\mid\mathbf{c}; \sigma) = \nabla_{\mathbf{x}} \log p_1(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\).
- Then considering the probability flow ODE, we have
\(\begin{aligned} \text{d}\mathbf{x}_{\sigma} &= -\sigma \nabla_{\mathbf{x}_{\sigma}} \log p(\mathbf{x}_{\sigma}; \sigma) \text{d}\sigma \\ &= -\sigma \left( \underbrace{\nabla_{\mathbf{x}} \log p_1(\mathbf{x}\mid\mathbf{c};\sigma)}_{\text{generation by }p_1} + (w-1) \underbrace{\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}}_{\text{perturbation}} \right) \text{d}\sigma \\ \end{aligned}\).- Interpretation)
- The image generation consists of
- the generation by the density \(p_1\).
- the perturbation
- i.e.) the increase in the likelihood that a hypothetical classifier would attribute for the sample having come from density \(p_1\). rather than \(p_0\).
- i.e.) The direction that makes \(p_1\). and \(p_0\). more distinctive
- Larger \(w\)., stronger effect
- The image generation consists of
- Interpretation)
- \(D_w(\mathbf{x};\sigma,\mathbf{c}) = w D_1(\mathbf{x};\sigma,\mathbf{c}) + (1-w)D_0(\mathbf{x};\sigma,\mathbf{c})\).
- Classifier Free Guidance)
- Let \(D_0\). be unconditional : \(\mathbf{c}=\varnothing\).
- i.e.) \(D_0(\mathbf{x};\sigma,\varnothing)\). with \(p_0(\mathbf{x}\mid\sigma)\).
- Put \(D_1 = D_\theta\)..
- Then the above dynamics goes
\(\begin{aligned} \nabla_{\mathbf{x}} \log p_w(\mathbf{x}; \sigma) &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log \frac{p_\theta(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x};\sigma)} \\ &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log \frac{p(\mathbf{c}\mid\mathbf{x};\sigma)}{p(\mathbf{c};\sigma)} \\ &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log p(\mathbf{c}\mid\mathbf{x};\sigma) - (w-1) \nabla_{\mathbf{x}} \log p(\mathbf{c};\sigma) \\ &= \nabla_{\mathbf{x}} \log p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) + (w-1) \nabla_{\mathbf{x}} \log p(\mathbf{c}\mid\mathbf{x};\sigma) & (\because \nabla_{\mathbf{x}}\log p(\mathbf{c};\sigma)=0) \\ \end{aligned}\). - From the above, we may get
- \(p_w(\mathbf{x}; \sigma) \varpropto p_\theta(\mathbf{x}\mid\mathbf{c};\sigma) \cdot \underbrace{p(\mathbf{c}\mid\mathbf{x};\sigma)^{w-1}}_{\text{implied density}}\).
- This is different from the valid heat diffusion that add noise as
- \(p(\mathbf{x}\mid\mathbf{c}; \sigma) = p_{\text{data}}(\mathbf{x}\mid\mathbf{c}) * \mathcal{N}(\mathbf{0},\sigma^2\mathbf{I})\).
- This paper argues that this leads to
- distorted sampling trajectories
- exaggerated truncation
- mode dropping in results
- over saturation of colors
- Let \(D_0\). be unconditional : \(\mathbf{c}=\varnothing\).
- General settings)
- Strength)
- Improves the image quality
- Drawbacks)
- Limits its usage as a general low-temperature sampling model.
- Applicable only for conditional generation
- Sampling trajectory can overshoot the desired conditional distribution.
- Result)
- Skewed and often overly simplified image compositions.
- Result)
- The prompt alignment and quality improvement effects cannot be controlled separately.
- Unclear how exactly they are related to each other.
3. Why does CFG improve image quality?
- Summary)
- Score-based models tend to produce outlier outputs.
- Why?)
- The score matching objective is closely related to the ML estimation.
- The ML estimation that utilizes the KL-Divergence leads to a conservative fit of the data distribution.
- i.e.) Attempting to cover all training samples by giving extreme penalties to the model if it severely underestimates the likelihood of any training example.
- The conservative fit to data distribution results in generating strange and unlikely images from the distribution’s extreme.
- cf.) They are not learnt accurately but included just to avoid the high loss penalty.
- Why?)
- CFG eliminates the outliers.
- How?)
- The unconditional denoiser \(D_0\). underfits the data and results in alleviating the conservative fit of the score-based model.
- Why?)
- Recall that the CFG had two denoiser networks
- \(D_0(\mathbf{x};\sigma)\). : the unconditional denoiser
- \(D_1(\mathbf{x};\sigma,\mathbf{c})\). : the conditional denoiser
- \(D_0\). faces a more difficult task on that it has to generate from all classes at once.
- Whereas \(D_1\). can focus on a single class \(\mathbf{c}\)..
- Given a small slice of training budget, $$D_0$ attains a worse fit to the data.
- Recall that the CFG had two denoiser networks
- Why?)
- The perturbation term \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c})}\). concentrates the samples to be drawn at the “good side” (where the training data is rich) of the data manifold.
- Why?)
- Unconditionally learned distribution \(p_0\). has more spread out distribution compared to the conditional one \(p_1\)..
- Thus, its ratio \(\displaystyle\frac{p_1}{p_0}\). decrease sharply with distance from the manifold.
- Thus, the gradient \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x};\sigma)}\). point inward towards the data manifold.
- This leads the model to sample more within the data manifold.
- Why?)
- The unconditional denoiser \(D_0\). underfits the data and results in alleviating the conservative fit of the score-based model.
- How?)
- Score-based models tend to produce outlier outputs.
Model) 2D Toy Example
- Desc.)
- Dataset is designed to exhibit…
- Low Local Dimensionality for the zero noise case (\(\sigma = 0\).)
- highly anisotropic
- The data distribution is not uniform in all directions
- It is stretched out along certain meaningful axes
- Extremely thin in other, nonsensical directions
- narrow support
- “Support” refers to the region where data actually exists.
- Even if you move along a meaningful “anisotropic” direction, straying even slightly from the path will result in an unrealistic image.
- Why?) No data exists in that region!
- highly anisotropic
- Hierarchical emergence of local detail upon noise removal
- As noise decreases (\(\sigma\rightarrow0\).)…
- Isotropic distribution becomes anisotropic.
- Support gets narrower.
- As noise decreases (\(\sigma\rightarrow0\).)…
- Low Local Dimensionality for the zero noise case (\(\sigma = 0\).)
- Dataset is designed to exhibit…
- Result) Sample Distribution
- (a) : the underlying distribution
- (b) : CFG with no guidance.
- Unlikely outlier samples are generated
- (c) : CFG
- Overshoots the correction and produces a narrow distribution than the ground truth.
- However, this overshooting does not appear to have an adverse effect on images.
- (d) : Naive Truncation
- e.g.)
- GAN’s truncation techniques
- Lowering temperature in generative language models
- Desc.)
- Concentrated in high-probability regions
- But, isotropic with low diversity, oversimplified details, and monotone texture
- e.g.)
- (e) : Autoguidance
- Result) Gradient of each probability distributions
- (a) : Distribution learned by the conditional denoiser \(D_1(\mathbf{x}\mid\mathbf{c};\sigma)\). with intermediate noise level \(\sigma_{\text{mid}}\).
- Learned density approximates the underlying ground truth quite well
- But, it fails to replicate its sharper details.
- (b) : Distribution learned by the unconditional Denoiser \(D_0(\mathbf{x};\sigma)\). with intermediate noise level \(\sigma_{\text{mid}}\).
- Learned a further spread-out density compared to (a)
- Looser fit to the data
- (c) : The perturbation term
- (d) : DM with no guidance
- (e) : CFG with \(w=4\).
- (a) : Distribution learned by the conditional denoiser \(D_1(\mathbf{x}\mid\mathbf{c};\sigma)\). with intermediate noise level \(\sigma_{\text{mid}}\).
4. Our Method (Autoguidance)
- Key Idea)
- Use the same frame work as the CFG.
- Generalize the \(D_0\). into the weaker version of \(D_1\)..
- cf.)
- In CFG, \(D_0\). was the unconditional version of \(D_1\)., which had penalty of marginalizing all class conditions.
- Here, \(D_0\). is generalized to any weaker version of \(D_1\). that has some penalty in training, so that it will underfit the data distribution.
- cf.)
- The sampling distribution will be guided in two directions of…
- maximizing the log likelihood of the original model \(\nabla_\mathbf{x}\log p_1(\mathbf{x}\mid\mathbf{c};\sigma)\).
- avoiding the discrepancy between the original and the weaker models’ distribution \(\displaystyle\nabla_{\mathbf{x}} \log \frac{p_1(\mathbf{x}\mid\mathbf{c};\sigma)}{p_0(\mathbf{x}\mid\mathbf{c};\sigma)}\).
- Ideation)
- Recall that the score matching objective tended to over-emphasize low probability regions of the data distribution.
- This may due to various factors such as network architecture.
- The problem is that we cannot expect to identify and characterize the specific issues a priori.
- If we set up an additional weaker version of the same model, it may suffer more on similar errors in the same region.
- \(D_1\). : the high-quality model
- \(D_0\). : the poor-quality model
- trained on the same task, conditioning, and data distribution as \(D_1\).
- suffer from certain additional degradations
- e.g.) low-capacity, under-training
- By measuring the difference between the error made by the original model and the weaker one, we may identify where the error is made.
Concept) Synthetic Degradation
- Why doing this?)
- To validate the hypothesis that \(D_1\). and \(D_0\). must suffer from the same kind of degradations
- Result)
- As long as the degradations are compatible, autoguidance largely undoes the damage caused by the corruptions.
- Methods)
- Base Model)
- EDM2-S trained on ImageNet 512 without dropout
- FID : 2.56
- Dropout
- Degradation)
- \(D_1\). : 5% dropout
- \(D_0\). : 10% dropout
- Result)
- FID : 2.25, with \(w=2.25\).
- Degradation)
- Input Noise
- Degradation)
- \(D_1\). : Increase noise level \(\sigma\). by 10%
- \(D_0\). : Increase noise level \(\sigma\). by 20%
- Result)
- FID : 2.56, with \(w=2.00\).
- Degradation)
- Mismatched Degradation)
- Applying dropout to one model, while increasing noise on the other.
- This did not improved the results.
- Base Model)
Experiment)
- Settings)
- Data)
- ImageNet at two resolutions \(512^2, 64^2\).
- Models)
- EDM2
- Latent Diffusion for \(512^2\). images
- Worked directly on RGB for \(64^2\). images
- EDM2
- Degradations)
- Shorter training time
- Reduced capacity
- Data)
- Result)
- Best result when both degradations were enabled
- XS-sized guiding model with 1/16th of the training iterations!
- Best result when both degradations were enabled
Enjoy Reading This Article?
Here are some more articles you might like to read next: