Classifier-Free Diffusion Guidance (CFG)
hozy Summary
- Classifier Guidance (Dhariwal & Nichol 2021)
- Goal)
- A technique to boost the sample quality of a diffusion model using an extra trained classifier.
- Enabled generating low temperature samples from a diffusion model
- Model)
\(\begin{aligned} \tilde{\epsilon}_\theta(\mathbf{z}_\lambda, \mathbf{c}) &= \epsilon_\theta(\mathbf{z}_\lambda, \mathbf{c}) - w\sigma_\lambda \nabla_{\mathbf{z}_\lambda} \log p_\theta(\mathbf{c}\mid\mathbf{z}_{\lambda}) \\ &\approx - \sigma_\lambda \nabla_{\mathbf{z}_\lambda} \left[ \log p(\mathbf{z}_{\lambda}\mid\mathbf{c}) + w\log p_\theta(\mathbf{c}\mid\mathbf{z}_{\lambda}) \right] \end{aligned}\). - Drawback)
- Complicates the diffusion model training pipeline.
- Why?)
- Requires training an extra classifier in DM
- Classifier must be trained on noisy data so it is generally not possible to plug in a pre-trained classifier.
- Why?)
- Mixing \(\nabla_{\mathbf{z}_{\lambda}} \log p(\mathbf{c}\mid \mathbf{z}_{\lambda})\). can be interpreted as a gradient-based adversarial attack!
- cf.) Just like in GAN models where the generator’s adversarial attack improves the quality of the generation.
- Complicates the diffusion model training pipeline.
- Goal)
- Classifier Free Guidance (Curr Paper)
- Idea)
- Have the same effect as the classifier guidance, but without using the gradients.
- cf.) \(\nabla_{\mathbf{z}_{\lambda}} \log p(\mathbf{c}\mid \mathbf{z}_{\lambda})\).
- Simultaneously train the conditional and the unconditional models in a single neural network!
- How?)
- Randomly set \(\mathbf{c} = \varnothing\). with some probability \(p_{\text{uncond}}\)..
- If \(\mathbf{c} = \varnothing\)., we train the unconditional model
- Else, we train the conditional model
- Randomly set \(\mathbf{c} = \varnothing\). with some probability \(p_{\text{uncond}}\)..
- How?)
- Have the same effect as the classifier guidance, but without using the gradients.
- Model)
- Let
- \(p_\theta(\mathbf{z})\). : an unconditional denoising diffusion model
- s.t. parameterized through the score estimator \(\epsilon_\theta(\mathbf{z}_\lambda)\).
- \(p_\theta(\mathbf{z}\vert\mathbf{c})\). : a conditional denoising diffusion model
- s.t. parameterized through the score estimator \(\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c})\).
- \(p_\theta(\mathbf{z})\). : an unconditional denoising diffusion model
- Then, we may set up the model as
- \(\tilde{\epsilon}_\theta(\mathbf{z}_\lambda,\mathbf{c}) = (1+w)\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c}) - w \epsilon_\theta(\mathbf{z}_\lambda)\).
- Let
- Idea)
2. Background
Concept) Continuous Time Diffusion Model
Problem Setting)
- \(\mathbf{x}\sim p(\mathbf{x})\).
- \(\displaystyle\lambda \triangleq \log\frac{\alpha_{\lambda}^2}{\sigma_{\lambda}^2}\). : the log SNR (signal-noise-ratio)
- where
- \(\alpha_\lambda^2\). : the variance contribution from the signal
- \(\sigma_\lambda^2\). : the variance contribution from the noise
- cf.) Consider a forward process sample
- \(\mathbf{z}_\lambda = \underbrace{\alpha_\lambda \mathbf{x}}_{\text{signal}} + \underbrace{\sigma_\lambda \epsilon}_{\text{noise}} ,\quad \epsilon\sim\mathcal{N}(0,\mathbf{I})\).
- cf.) Consider a forward process sample
- \(\alpha_\lambda^2+\sigma_\lambda^2=1\).
- Desc.)
- Here, the log SNR \(\lambda\). substitutes the time step \(t\). in the discrete settings.
- e.g.) \(\underbrace{t=\{1,2,\cdots,T\}}_{\text{discrete time}}\rightarrow \underbrace{\lambda\in[\lambda_{\min}, \lambda_{\max}]}_{\text{continuous time}}\).
- Why?) The log SNR corresponds with the time step.
- \(\lambda\rightarrow\infty \Leftrightarrow \alpha_\lambda^2 \gg \sigma_\lambda^2\). : Low noise level
- \(t\rightarrow 0\). in the discrete setting
- \(\lambda\rightarrow-\infty \Leftrightarrow \alpha_\lambda^2 \ll \sigma_\lambda^2\). : Pure noise
- \(t\rightarrow T\). in the discrete setting
- \(\lambda\rightarrow\infty \Leftrightarrow \alpha_\lambda^2 \gg \sigma_\lambda^2\). : Low noise level
- Here, the log SNR \(\lambda\). substitutes the time step \(t\). in the discrete settings.
- where
- \(\mathbf{z} = \{ \mathbf{z}_\lambda\mid\lambda\in[\lambda_{\min}, \lambda_{\max}] \},\quad \lambda_{\min}\lt \lambda_{\max}\in\mathbb{R}\).
- \(\mathbf{z}_\lambda\). is the forward process sample with the noise level \(\lambda\).
- \(\lambda\). is clipped to the range \([\lambda_{\min}, \lambda_{\max}]\). for the stability.
- Theoretically, \(\lambda\in\mathbb{R}\).
Forward Process)
- Def.)
- \(q(\mathbf{z}\mid\mathbf{x})\). : the forward process
- where
- \(q(\mathbf{z}_\lambda\mid\mathbf{x}) = \mathcal{N}(\alpha_\lambda\mathbf{x}, \sigma_\lambda^2 \mathbf{I}),\quad\text{s.t. } \alpha_\lambda^2+\sigma_\lambda^2=1\).
- \(\alpha_\lambda^2 = \displaystyle\frac{1}{1+e^{-\lambda}}\). : the signal variance
- cf.) This corresponds to \(\bar{\alpha}_t = \displaystyle\prod_{s=1}^t \alpha_s\).
- Derivation)
- By definition
- \(\lambda \triangleq \log\displaystyle\frac{\alpha_\lambda^2}{\sigma_\lambda^2} = \log\displaystyle\frac{\alpha_\lambda^2}{1-\alpha_\lambda^2}\).
- Thus,
- \(e^{-\lambda} = \displaystyle\frac{1-\alpha_\lambda^2}{\alpha_\lambda^2} \Rightarrow \alpha_\lambda^2 = \frac{1}{1+e^{-\lambda}}\).
- By definition
- \(\sigma_\lambda^2 = 1-\alpha_\lambda^2\).
- where
- \(q(\mathbf{z}\mid\mathbf{x})\). : the forward process
- Prop.)
- \(q(\mathbf{z}_\lambda\mid\mathbf{z}_{\lambda'}) = \mathcal{N}\left(\frac{\alpha_\lambda}{\alpha_{\lambda'}}\mathbf{z}_{\lambda'}, {\sigma_{\lambda\mid\lambda'}}^2 \mathbf{I}\right)\).
- where
- \(\lambda\lt\lambda'\).
- \({\sigma_{\lambda\mid\lambda'}}^2 = \sigma_\lambda^2 - \displaystyle\frac{\alpha_\lambda^2}{\alpha_{\lambda'}^2} \sigma_{\lambda'}^2 = \left( 1-e^{\lambda-\lambda'} \right)\sigma_\lambda^2\).
- where
- \(q(\mathbf{z}_\lambda\mid\mathbf{z}_{\lambda'}) = \mathcal{N}\left(\frac{\alpha_\lambda}{\alpha_{\lambda'}}\mathbf{z}_{\lambda'}, {\sigma_{\lambda\mid\lambda'}}^2 \mathbf{I}\right)\).
Reverse Process)
- Def.)
- \(p(\mathbf{z}) = \displaystyle\int q(\mathbf{z}\mid\mathbf{x}) p(\mathbf{x})\text{d}\mathbf{x}\). : the marginal of \(\mathbf{z}\).
- where
- \(\mathbf{x}\sim p(\mathbf{x})\).
- \(\mathbf{z}\sim q(\mathbf{z}\mid\mathbf{x})\).
- Prop)
- What we want to know.
- But intractable!
- where
- \(p_\theta(\mathbf{z}_{\lambda'}\mid\mathbf{z}_{\lambda}) = \mathcal{N}\left( \tilde{\mu}_{\lambda'\mid\lambda}(\mathbf{z}_{\lambda}, \underbrace{\mathbf{x}_\theta}_{\text{model!}}(\mathbf{z}_{\lambda})), \quad \underbrace{\left({\tilde{\sigma}_{\lambda'\mid\lambda}}^2 \right)^{1-v} \left({\sigma_{\lambda\mid\lambda'}}^2 \right)^{v}}_{\text{from Improved DDPM}} \right)\). : the reverse process
- Derivation)
- Set \(p_\theta(\mathbf{z}_{\lambda_{\min}}) = \mathcal{N}(\mathbf{0,I})\). to be the pure noise.
- Get the posterior of the forward process as
- \(q(\mathbf{z}_{\lambda'}\mid\mathbf{z}_{\lambda}, \mathbf{x}) = \mathcal{N}\left( \tilde{\mu}_{\lambda'\mid\lambda}(\mathbf{z}_{\lambda}, \mathbf{x}), \quad \left({\tilde{\sigma}_{\lambda'\mid\lambda}}^2 \right) \mathbf{I} \right)\).
- where
- \(\tilde{\mu}_{\lambda'\mid\lambda}(\mathbf{z}_{\lambda}, \mathbf{x}) = e^{\lambda-\lambda'}\left(\frac{\alpha_{\lambda'}}{\alpha_{\lambda}}\right)\mathbf{z}_{\lambda} + \left(1- e^{\lambda-\lambda'}\right) \alpha_{\lambda'} \mathbf{x}\).
- \({\tilde{\sigma}_{\lambda'\mid\lambda}}^2 = \left(1-e^{\lambda-\lambda'} \right) {\sigma_{\lambda'}}^2\).
- where
- \(q(\mathbf{z}_{\lambda'}\mid\mathbf{z}_{\lambda}, \mathbf{x}) = \mathcal{N}\left( \tilde{\mu}_{\lambda'\mid\lambda}(\mathbf{z}_{\lambda}, \mathbf{x}), \quad \left({\tilde{\sigma}_{\lambda'\mid\lambda}}^2 \right) \mathbf{I} \right)\).
- Use a model \(\mathbf{x}_\theta\). s.t. \(\mathbf{x}_\theta \approx \mathbf{x}\). so that \(\underbrace{p_\theta(\mathbf{z}_{\lambda'}\mid\mathbf{z}_{\lambda})}_{\text{our model}} \approx \underbrace{q(\mathbf{z}_{\lambda'}\mid\mathbf{z}_{\lambda}, \mathbf{x})}_{\text{posterior of }q}\).
- How?) \(\epsilon\).-prediction from DDPM’s simplified objective
- \(\mathbf{x}_\theta(\mathbf{z}_{\lambda}) = (\mathbf{z}_{\lambda} - \sigma_\lambda \epsilon_\theta(\mathbf{z}_{\lambda}))/\alpha_\lambda\).
- How?) \(\epsilon\).-prediction from DDPM’s simplified objective
- The variance is a log-space interpolation of \({\tilde{\sigma}_{\lambda'\mid\lambda}}^2\). and \({\sigma_{\lambda\mid\lambda'}}^2\).
- cf.) Refer to the Improved DDPM’s reverse process variance.
- \(v\). is a constant hyperparameter in this paper.
- Training Objective)
- \(\mathbb{E}_{\epsilon,\lambda}\Big[ \big\Vert \epsilon_\theta(\mathbf{z}_{\lambda}) - \epsilon \big\Vert_2^2 \Big]\).
- where
- \(\epsilon\sim\mathcal{N}(\mathbf{0,I})\).
- \(z_{\lambda} = \alpha_\lambda \mathbf{x} + \sigma_\lambda \epsilon\).
- \(\lambda \sim p(\lambda)\). drawn over \([\lambda_{\min}, \lambda_{\max}]\).
- Two Cases)
- \(p(\lambda) = \text{Uniform}(\lambda_{\min}, \lambda_{\max})\).
- (v) Not uniform.
- \(\lambda = -2\log\tan(au+b)\).
- where
- \(u\sim\text{Uniform}(0,1)\).
- \(b = \arctan(e^{-\lambda_{\max}/2})\).
- \(a = \arctan(e^{-\lambda_{\min}/2})-b\).
- where
- Desc.)
- a hyperbolic secant distribution modified to be supported on a bounded interval.
- Inspired by the Improved DDPM’s cosine noise schedule
- \(\lambda = -2\log\tan(au+b)\).
- Two Cases)
- Props.)
- \(\epsilon_\theta(\mathbf{z}_\lambda) \approx -\sigma_\lambda \nabla_{\mathbf{z}_\lambda} \log p(\mathbf{z}_{\lambda})\).
- Desc.)
- (LHS) Guessing the original noise \(\epsilon\). in DDPM’s perspective.
- (RHS) \(\nabla_{\mathbf{z}_\lambda} \log p(\mathbf{z}_{\lambda})\). is the score from the Score-Based Model
- Desc.)
- Conditional Generative Model Case
- For the conditioning information \(\mathbf{c}\)., we may set
- \(\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c})\).
- For the conditioning information \(\mathbf{c}\)., we may set
- \(\epsilon_\theta(\mathbf{z}_\lambda) \approx -\sigma_\lambda \nabla_{\mathbf{z}_\lambda} \log p(\mathbf{z}_{\lambda})\).
- where
- \(\mathbb{E}_{\epsilon,\lambda}\Big[ \big\Vert \epsilon_\theta(\mathbf{z}_{\lambda}) - \epsilon \big\Vert_2^2 \Big]\).
- Derivation)
- \(p(\mathbf{z}) = \displaystyle\int q(\mathbf{z}\mid\mathbf{x}) p(\mathbf{x})\text{d}\mathbf{x}\). : the marginal of \(\mathbf{z}\).
Concept) Temperature
- Def.) \(T\).
- A scaling parameter that controls the “sharpness” of a probability distribution.
- In the softmax, \(p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)},\).
- where $z_i$ are the logits.
- Properties
- \(T = 1\).: recovers the standard softmax.
- \(T < 1\).: distribution becomes sharper (high-confidence, deterministic; one class dominates).
- i.e.) The low temperature!
- \(T > 1\).: distribution becomes flatter (uncertain, closer to uniform).
- \(\lim_{T\to 0}\).: approaches an argmax (one-hot selection).
- \(\lim_{T\to \infty}\).: approaches a uniform distribution.
- Usage
- In generation
- lower \(T\rightarrow\). higher quality but lower diversity;
- higher \(T\rightarrow\). more diverse but lower quality.
- In knowledge distillation: temperature is used to smooth teacher predictions.
- In generation
Concept) Trade-off between IS and FID
- Inception Score (IS)
- Def.)
- \(\text{IS} = \exp\Big(\mathbb{E}_x\big[D_{KL}(p(y\mid x)\mid p(y))\big]\Big)\).
- Prop.)
- Increases when the generated images are more class-distinct and confident
- Larger IS \(\Leftrightarrow\). More distinctive classes \(\Leftrightarrow\). Better!
- Def.)
- Fréchet Inception Distance (FID)
- Def.)
- Let
- \(p_{\text{data}} = \mathcal{N}(\mu_{\text{data}}, \Sigma_{\text{data}})\).
- \(p_{\text{gen}} = \mathcal{N}(\mu_{\text{gen}}, \Sigma_{\text{gen}})\).
- Then
- \(\text{FID} = \Vert \mu_{\text{data}} - \mu_{\text{gen}} \Vert^2 + \text{Tr} \left( \Sigma_{\text{data}} + \Sigma_{\text{gen}} - 2\sqrt{\Sigma_{\text{data}}\Sigma_{\text{gen}}} \right)\).
- Let
- Prop.)
- Decreases when the distribution of the generated data is similar to the original data’s distribution.
- Smaller FID \(\Leftrightarrow\). More similarity \(\Leftrightarrow\). Better!
- Def.)
- Trade-off between IS and FID
- If the generated data becomes more class distinct, then
- IS increases
- FID worsens (increases) because the diversity is reduced and the generated distribution deviates from the real one.
- If the generated data becomes more class distinct, then
3. Guidance
3.1 Classifier Guidance
Dhariwal & Nichol 2021
- Desc.)
- A technique to boost the sample quality of a diffusion model using an extra trained classifier.
- Enabled generating low temperature samples from a diffusion model
- Concept) Low Temperature
- Desc.)
- A technique that makes a probability distribution sharper
- Prop.)
- Sharper distribution leads to more deterministic sampling.
- Thus, the output tends to be more stable and of higher quality, but the diversity decreases.
- Desc.)
- cf.) Similar to BigGAN and low temperature Glow
- Concept) Low Temperature
- Problem Setting)
- \(\mathbf{c}\). : the classifier
- \(\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c})\). : the classifier guidance
- where \(\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c}) \approx -\sigma_\lambda \nabla_{\mathbf{z}_\lambda} \log p(\mathbf{z}_{\lambda}\mid\mathbf{c})\). : the diffusion score
- Goal)
- We want to generate data that can be clearly classified into the chosen category.
- How?)
- Maximize \(\log p(\mathbf{z}\mid\mathbf{c})\)..
- Meaning)
- For a given class \(\mathbf{c}\)., higher \(\log p(\mathbf{z}\mid\mathbf{c})\). indicates a higher probability of generating \(\mathbf{z}\). that is recognized as belonging to \(\mathbf{c}\).
- Meaning)
- Practical Trick)
- Mix a diffusion model’s score estimate with the input gradient of the log probability of a classifier
- \(\nabla_{\mathbf{z}_\lambda} \log \tilde{p}_\theta(\mathbf{z}_\lambda\mid\mathbf{c}) = \nabla_{\mathbf{z}_\lambda} \log p_\theta(\mathbf{z}_\lambda\mid\mathbf{c}) + w \cdot \underbrace{\nabla_{\mathbf{z}_\lambda} \log p_\theta(\mathbf{c}\mid\mathbf{z}_\lambda)}_{\text{extra guidance term!}}\).
- Why?)
- The extra term \(\log p(\mathbf{c\mid z})\). may steer the gradient towards increasing the probability that the classifier identifies \(\mathbf{z}\). as class \(\mathbf{c}\).
- Mix a diffusion model’s score estimate with the input gradient of the log probability of a classifier
- Maximize \(\log p(\mathbf{z}\mid\mathbf{c})\)..
- Model)
\(\begin{aligned} \tilde{\epsilon}_\theta(\mathbf{z}_\lambda, \mathbf{c}) &= \epsilon_\theta(\mathbf{z}_\lambda, \mathbf{c}) - w\sigma_\lambda \nabla_{\mathbf{z}_\lambda} \log p_\theta(\mathbf{c}\mid\mathbf{z}_{\lambda}) \\ &\approx - \sigma_\lambda \nabla_{\mathbf{z}_\lambda} \left[ \log p(\mathbf{z}_{\lambda}\mid\mathbf{c}) + w\log p_\theta(\mathbf{c}\mid\mathbf{z}_{\lambda}) \right] \end{aligned}\).- where
- \(w\). : a parameter that controls the strength of the classifier guidance
- where
- Usage)
- Replace the original \(\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c})\). with \(\tilde{\epsilon}_\theta(\mathbf{z}_\lambda, \mathbf{c})\). in the objective function.
- Then the sampling distribution goes
- \(\tilde{p}_\theta(\mathbf{z}_\lambda\mid\mathbf{c})\varpropto p_\theta(\mathbf{z}_\lambda\mid\mathbf{c}) \cdot p_\theta(\mathbf{c}\mid\mathbf{z}_\lambda)^w\).
- Desc.)
- By setting \(w\gt0\)., we may up-weight the probability of data for which the classifier \(p_\theta(\mathbf{c\mid z}_\lambda)\). assigns high likelihood to the correct label.
- Desc.)
- \(\tilde{p}_\theta(\mathbf{z}_\lambda\mid\mathbf{c})\varpropto p_\theta(\mathbf{z}_\lambda\mid\mathbf{c}) \cdot p_\theta(\mathbf{c}\mid\mathbf{z}_\lambda)^w\).
- Effect)
- By varying the strength \(w\). of the classifier gradient, they could trade off Inception score and FID score.
- Bigger \(s\). \(\Rightarrow\). More samples like that classifier \(y\).
- Then
- Inception Score increases : more recognizable
- FID decreases : more realistic
- Less diversity
- Then
- Bigger \(s\). \(\Rightarrow\). More samples like that classifier \(y\).
- By varying the strength \(w\). of the classifier gradient, they could trade off Inception score and FID score.
- e.g.)
- Three mixture of Gaussian distributions being more separable.
- Three mixture of Gaussian distributions being more separable.
- Drawback)
- Complicates the diffusion model training pipeline.
- Why?)
- Requires training an extra classifier in DM
- Classifier must be trained on noisy data so it is generally not possible to plug in a pre-trained classifier.
- Why?)
- Mixing \(\nabla_{\mathbf{z}_{\lambda}} \log p(\mathbf{c}\mid \mathbf{z}_{\lambda})\). can be interpreted as a gradient-based adversarial attack!
- cf.) Just like in GAN models where the generator’s adversarial attack improves the quality of the generation.
- Complicates the diffusion model training pipeline.
3.2 Classifier-Free Guidance
- Goal)
- Have the same effect as the classifier guidance, but without using the gradients.
- cf.) \(\nabla_{\mathbf{z}_{\lambda}} \log p(\mathbf{c}\mid \mathbf{z}_{\lambda})\).
- Have the same effect as the classifier guidance, but without using the gradients.
- Idea)
- Simultaneously train the conditional and the unconditional models in a single neural network!
- How?)
- Randomly set \(\mathbf{c} = \varnothing\). with some probability \(p_{\text{uncond}}\)..
- If \(\mathbf{c} = \varnothing\)., we train the unconditional model
- Else, we train the conditional model
- Randomly set \(\mathbf{c} = \varnothing\). with some probability \(p_{\text{uncond}}\)..
- Model)
- Let
- \(p_\theta(\mathbf{z})\). : an unconditional denoising diffusion model
- s.t. parameterized through the score estimator \(\epsilon_\theta(\mathbf{z}_\lambda)\).
- \(p_\theta(\mathbf{z}\vert\mathbf{c})\). : a conditional denoising diffusion model
- s.t. parameterized through the score estimator \(\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c})\).
- \(p_\theta(\mathbf{z})\). : an unconditional denoising diffusion model
- Then, we may set up the model as
- \(\tilde{\epsilon}_\theta(\mathbf{z}_\lambda,\mathbf{c}) = (1+w)\epsilon_\theta(\mathbf{z}_\lambda,\mathbf{c}) - w \epsilon_\theta(\mathbf{z}_\lambda)\).
- Let
- Prop.)
- No classifier gradient in the model.
- Thus, taking a step in the \(\tilde{\epsilon}_\theta\). direction cannot be interpreted as a gradient-based adversarial attack on an image classifier.
- Not even implicitly has the classifier gradient in the model.
- \(\tilde{\epsilon}_\theta\). is constructed from score estimates that are non-conservative vector fields.
- Why?) An unconstrained neural network is used!
- cf.) A non-conservative vector fields has no mapped scalar function.
- Thus, no scalar potential function exists such as a classifier log likelihood.
- cf.) Implicit Classifier Gradient \(\epsilon^*\).
- Derivation)
- Start from the Bayes Rule s.t. \(p^i(\mathbf{c\mid z}_\lambda) \varpropto \displaystyle\frac{p(\mathbf{z}_\lambda\mid\mathbf{c})}{p(\mathbf{z}_\lambda)}\)..
- If we have access to exact scores \(\epsilon^*(\mathbf{z}_\lambda,\mathbf{c})\). and \(\epsilon^*(\mathbf{z}_\lambda)\). we may get the gradient of this classifier as
- \(\nabla_{\mathbf{z}_\lambda} \log p^i(\mathbf{c}\mid \mathbf{z}_\lambda) = -\frac{1}{\sigma_\lambda} \left[\epsilon^*(\mathbf{z}_\lambda,\mathbf{c})-\epsilon^*(\mathbf{z}_\lambda)\right]\).
- Derivation)
- \(\tilde{\epsilon}_\theta\). is constructed from score estimates that are non-conservative vector fields.
- No classifier gradient in the model.
Tech) CFG Algorithms
- Training
- Sampling
- Strength)
- Simplicity.
- One line change in the code enabled the trade-off between IS and FID.
- No need for an extra trained classifier like the classifier guidance model.
- Simplicity.
Enjoy Reading This Article?
Here are some more articles you might like to read next: