(DM Reconst.) Ch.3 Score-Based Perspective - From EBMs to NCSN

Diffusion Model Conceptual Reconstruction following The Principles of Diffusion Models



Hozy Summary

EBM $\rightarrow$ Score Matching $\rightarrow$ Score-Based Generative Model $\rightarrow$ Sliced Score Matching / Denoising Score Matching $\rightarrow$ NCSN

  • EBM
    • Def.)
      • a model that assigns low energy (stability) on the probable datapoint.
      • \(\displaystyle p_\phi(\mathbf{x}) := \frac{\exp{(-E_\phi(\mathbf{x}))}}{Z_\phi}\).
    • Drawback)
      • Intractable $Z_\phi$ and its gradient.
  • Score Matching
    • Def.)
      • Calculate the score function which is the gradient of the log-density
        • \(s(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x}),\quad s:\mathbb{R}^D\rightarrow\mathbb{R}^D\).
    • Why it works?
      • \(\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})\).
        • where
          • $\tilde{p}(\mathbf{x})$ : an unnormalized density
            • e.g.) $\exp(-E_\phi(\mathbf{x}))$ in EBMs.
          • \(p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z},\quad Z = \int \tilde{p}(\mathbf{x})\text{d}\mathbf{x}\).
    • Advantage)
      • Tractability : $Z_\phi$ is now gone!
      • Complete Representation of the Underlying distribution.
    • Drawback)
      • Needs 2nd order derivative, which is computationally prohibitive in high dimensions.
    • Sampling Method)
      • Langevin SDE
        • Def.)
          • Discrete
            • \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n\),
              • for \(\eta\gt0, n=0,1,\ldots, \epsilon_n\sim\mathcal{N}(\mathbf{0, I})\)
          • Continuous ($\eta\rightarrow0$)
            • \(\text{d}\mathbf{x}(t) = \nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
        • Advantage)
          • Solving simple Langevin SDE is equivalent to sampling from the estimated distribution.
            • Why?) Under standard regularity assumptions, the distribution of $\mathbf{x}(t)$ converges to $p_\phi$ as $t\rightarrow\infty$.
        • Limit)
          • Sensitivity to hyperparmeters
          • Poor Mixing Time
  • Score-Based Generative Model
    • Idea)
    • Loss)
      \(\begin{aligned} \tilde{\mathcal{L}}_{\text{SM}}(\phi) &:= \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right] \\ &\equiv \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right) + \frac{1}{2} \Vert s_\phi(\mathbf{x}) \Vert_2^2 \right] \\ \end{aligned}\).
    • Sampling)
      • Discrete
        • \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
      • Continuous
        • \(\text{d}\mathbf{x}(t) = s_{\phi^\times}(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
    • Drawback)
      • $O(D^2)$ cost for training
        • cf.) \(\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)\) from \(\tilde{\mathcal{L}}_{\text{SM}}(\phi)\)
    • Improvements)
      • Sliced Score Matching
        • \(\tilde{\mathcal{L}}_{\text{SM}}(\phi) = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]\).
          • i.e.) Replaced the trace of the Jacobian with VJP operation in automatic differentiation
          • Still the reliance on the raw data makes the training fragile
      • Denoising Score Matching
        • \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]\).
          • i.e.) Perturb the data with noise with fixed scale of $\sigma$ and condition on the data.
          • e.g.) Additive Gaussian Noise
            • \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) = \frac{1}{2}\mathbb{E}_{\mathbf{x}, \epsilon} \left[ \left\Vert s_\phi(\mathbf{x}+\sigma\epsilon;\sigma) + \frac{\epsilon}{\sigma} \right\Vert_2^2 \right]\).
        • By Tweedie’s Formula, DSM is equivalent to denoising.
        • Limit)
          • Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
          • Langevin dynamics can be slow to converge or even fail in high-dimensional spaces.
  • Noise Conditional Score Network (NCSN)
    • Idea)
      • Inject Gaussian noise at multiple levels into the data distribution
      • Jointly train to estimate score functions across a range of noise scales.
    • Loss)
      • \(\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)\),
        • where
          • \(\mathcal{L}_{\text{DSM}}(\phi;\sigma_i) = \frac{1}{2} \mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right]\).
          • $\lambda(\sigma_i)\gt0$ : a weighting function for each scale
    • Sampling)
      • Annealed Langevin dynamics



3.1 Energy Based Models (EBMs)

Def.) Energy Function

  • Settings)
    • \(\mathbf{x}\in\mathbb{R}^D\) : a data point
  • Def.)
    • \(E_\phi(\mathbf{x})\) : an energy function parameterized by \(\phi\) s.t.
      • assigns lower energy to more likely configureations
      • \(\displaystyle p_\phi(\mathbf{x}) := \frac{\exp{(-E_\phi(\mathbf{x}))}}{Z_\phi}\).
      • \(\displaystyle Z_\phi := \int_{\mathbb{R}^D}\exp(-E_{\phi}(\mathbf{x}))\text{d}\mathbf{x}\) : the partition funciton ensuring normalization.
        • i.e.) \(\displaystyle\int_{\mathbb{R}^D} \displaystyle p_\phi(\mathbf{x})\text{d}\mathbf{x} = 1\)
  • Prop.)
    • Lower energy corresponds to higher probability.
      • In our setting \(\frac{\partial }{\partial E_\phi(\mathbf{x})} \exp{(-E_\phi(\mathbf{x}))} \lt 0\)
      • More stable state.
    • As the model trains (well),
      • it assigns high(low) probabilities on more(less) probable datapoints as the image below.
      • the probability mass is redistributed across the entire space rather than assigned independently to each region.
  • Training)
    • MLE (log)
      • Intractable
        • \(\log Z_\phi\) and its gradient
        • Alternative Sols.)
          • Contrastive divergence
          • Score matching


Concept) Score

  • Settings)
    • \(p(\mathbf{x}) : \mathbb{R}^D \rightarrow [0,\infty)\) : a density
  • Def.)
    • The score function is the gradient of the log-density
      • \(s(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x}),\quad s:\mathbb{R}^D\rightarrow\mathbb{R}^D\).
  • Prop.)
    • The score forms a vector field that points toward regions of higher probability.
  • Advantage)
    • Tractability
      • Why?) The partial function $Z$ cancels out.
        • For
          • $\tilde{p}(\mathbf{x})$ : an unnormalized density
            • e.g.) $\exp(-E_\phi(\mathbf{x}))$ in EBMs.
          • \(p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z},\quad Z = \int \tilde{p}(\mathbf{x})\text{d}\mathbf{x}\),
        • \(\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})\).
    • Complete Representation of the Underlying distribution
      • \(\log p(\mathbf{x}) = \log p(\mathbf{x}_0) + \displaystyle\int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t\).
        • where
          • $\mathbf{x}_0$ is a reference point
          • $\log p(\mathbf{x}_0)$ is fixed by normalization
        • cf.)
          \(\begin{aligned} \int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t &= \int_0^1 s(r(t))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t & (\text{Put } r(t) = \mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0)) \end{aligned}\) : interpolation


Model) Score Matching

Hyvärinen and Dayan, 2005

  • Idea)
    • Match the data and model scores!
  • Loss
    \(\begin{aligned} \mathcal{L}_{\text{SM}}(\phi) &= \frac{1}{2} \mathbb{E}_{p_{\text{data}(\mathbf{x})}} \left\Vert \nabla_\mathbf{x}\log p_\phi(\mathbf{x}) - \nabla_\mathbf{x}\log p_{\text{data}}(\mathbf{x}) \right\Vert_2^2 \\ &= \mathbb{E}_{p_{\text{data}(\mathbf{x})}} \left[ \text{Tr}\left(\nabla_\mathbf{x}^2 E_\phi(\mathbf{x})\right) + \frac{1}{2} \Vert \nabla_\mathbf{x} E_\phi(\mathbf{x}) \Vert_2^2 \right] + C \\ \end{aligned}\).
    • where
      • \(\nabla_\mathbf{x}^2 E_\phi(\mathbf{x})\) is the Hessian of $E_\phi$
      • $C$ is a constant independent of $\phi$.
    • Advantage)
      • $Z$ is gone!
      • Avoids sampling from the model during training.
    • Drawback)
      • Needs 2nd order derivative, which is computationally prohibitive in high dimensions.


Concept) Langevin Sampling

Concept) Discrete-Time Langevin Dynamics

  • Def.)
    \(\begin{aligned} \mathbf{x}_{n+1} &= \mathbf{x}_{n} - \eta\nabla_{\mathbf{x}} E_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n & n=0,1,\ldots \\ &= \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n & (\because\text{By def. of the score}) \end{aligned}\).
    • where
      • $\mathbf{x}_{0}$ is initialized from some distribution.
      • $\eta\gt0$ : the step size
      • $\epsilon_n\sim\mathcal{N}(\mathbf{0,I})$ : Gaussian noise
  • Prop.)
    • $\epsilon_n$ enables exploration beyond local minima by adding stochasticity.
    • The score function $\log p_\phi(\mathbf{x}_{n})$ guides the samples toward high-density regions.

Concept) Langevin Stochastic Differential Equation (SDE)

  • Idea)
  • Def.)
    • \(\text{d}\mathbf{x}(t) = \nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
  • Prop.)
    • Discrete-time update rule serves as the Euler-Maruyama discretization of this continuous SDE.
    • Under standard regularity assumptions, the distribution of $\mathbf{x}(t)$ converges to $p_\phi$ as $t\rightarrow\infty$.
      • Thus, we may sample by simply solving this SDE.
  • Limit)
    • In high-dimensional spaces,
      • its efficiency is highly sensitive to…
        • step size $\eta$
        • noise scale
        • number of iterations
      • it has the issue of poor mixing time.
        • Desc.)
          • Requires extremely long time to transition between regions of high probability
          • The more the data distribution is mixed with many isolated modes, the longer it takes.
            • Why?) Recall that it relies on the local stochastic updates to escape the local minima
              • \(\sqrt{2}\text{d}\mathbf{w}(t)\).



3.2 Score-Based Generative Model

Tech.) Parameterizing Score Function with Neural Network

  • Ideation)
    • Recall that the score \(s(\mathbf{x})=\nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x})\) is enough to reconstruct $p_{\text{data}}$.
    • Then, why don’t we train our model to directly parameterize the score?
      • i.e.) \(s_\phi(\mathbf{x}) \approx s(\mathbf{x})\)
    • We may set up a naive MSE loss as below.
      • \(\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right]\).
        • However, we don’t know \(s(\mathbf{x})\).
    • Hyvärinen and Dayan (2005) suggested an equivalent objective that depends only on $s_\phi(\mathbf{x})$.
  • Props.)
    • When we solve the \(\tilde{\mathcal{L}}_{\text{SM}}(\phi)\) minimization problem…
      • the divergence term \(\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)\) favors negative value
      • the norm term \(\frac{1}{2} \Vert s_\phi(\mathbf{x}) \Vert_2^2\) suppresses the score $s_\phi$
        • Thus, the result stationary points act as a attractive sinks.
          • i.e.) The points where the dynamics terminates!
          • p66 for details on stationarity and sink.
  • Sampling
    • Same Langevin dynamics can be used for sampling plugging in the frozen oracle score as…
      • Discrete
        • \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
          • cf.) \(\eta s_{\phi^\times}(\mathbf{x}_{n})\) is updated in the direction that increases the density. (Opposite direction of the energy $E_\phi$ case.)
      • Continuous
        • \(\text{d}\mathbf{x}(t) = s_{\phi^\times}(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
  • Drawback)
    • Training is still too costly.
      • Why?)
        • Calculating \(\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)\) takes $O(D^2)$ times.



Model) Sliced Score Matching

Song et al. 2020

  • Idea)
  • Model)
    • For a isotropic random vector $\mathbf{u}\in\mathbb{R}^D$
    • \(\tilde{\mathcal{L}}_{\text{SM}} = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]\).
  • Advantage)
    • Cheaper computation
      • No more Jacobian nor Hessian
      • Simply utilize the automatic differentiation
        • Jacobian- and vector-Jacobian-product operations (JVP/VJP)
          • Meaning)
            • Check the model’s behavior along random directions : the projected score is nudged to align with regions of higher data density, so data points become stationary in expectation.
  • Limit)
    • Its reliance on the raw data distribution makes the training fragile.
      • Why?)
        • for image data lying on low-dimensional manifolds, the score \(\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})\) may be undefined or unstable
        • the method only constrains the vector field at observed points, providing weak control in their neighborhoods


Concept) Isotropic Random Vector & Hutchinson’s Identity

  • Def.)
    • a random vector whose statistical properties are uniform in all directions
    • i.e.) \(\mathbb{E}\left[ \mathbf{u} \right] = \mathbf{0},\quad \mathbb{E}\bigg[\underbrace{ \mathbf{u}\mathbf{u}^\top }_{\text{outer prod}}\bigg] = \mathbf{I}\)
  • Hutchinson’s Identity)
    • \(\text{Tr}(\mathbf{A}) = \mathbb{E}_{\mathbf{u}}\left[ \mathbf{u^\top A u} \right]\).
    • \(\mathbb{E}_{\mathbf{u}}\left[(\mathbf{u}^\top \mathbf{v})^2\right] = \Vert \mathbf{v}\Vert_2^2\).
Pf. $$\begin{aligned} (1) &\; \text{Tr}(\mathbf{A}) = \text{Tr}(\mathbf{AI}) = \text{Tr}(\mathbf{A}\mathbb{E}[\mathbf{uu^\top}]) = \text{Tr}(\mathbb{E}[\mathbf{u^\top A u}]) \\ (2) &\; \mathbb{E}[(\mathbf{u^\top v})^2]=\mathbb{E}[\mathbf{(u^\top v)^\top (u^\top v)}] = \mathbb{E}[\mathbf{(v^\top u)(u^\top v)}] = \mathbb{E}[\mathbf{v^\top v}] = \Vert \mathbf{v} \Vert_2^2 \\ \end{aligned}$$




3.3 Denoising Score Matching

Vincent, 2011

  • Idea)
    • Recall we had the originally parameterized loss of \(\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - \nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x}) \Vert_2^2 \right]\).
    • Here, the score \(\nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x})\) was intractable.
    • What if we replace $p_{\text{data}}$ with marginally perturbed distribution \(p_{\sigma}(\tilde{\mathbf{x}})\)?
      • \(p_{\sigma}(\tilde{\mathbf{x}}) = \displaystyle\int p_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x}) \text{d}\mathbf{x}\).
        • i.e.) Injecting noise into the data $\mathbf{x}\sim p_{\text{data}}$ via a known conditional distribution \(p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})\) with scale $\sigma$
    • Although, \(\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}})\) is intractable
      • Vincent showed that conditioning on $\mathbf{x}\sim p_{\text{data}}$ yields an equivalent tractable objective : Denoising Score Matching (DSM)
  • Model)
    • \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]\).
  • Props.)
    • The optimal minimizer $s^*(\tilde{\mathbf{x}};\sigma)$ satisfies
      • \(s^*(\tilde{\mathbf{x}};\sigma) = \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}})\).
    • \(\sigma\rightarrow0\Rightarrow\begin{cases} p_\sigma(\tilde{\mathbf{x}}) \rightarrow p_{\text{data}}(\mathbf{x}) \\ s^*(\tilde{\mathbf{x}};\sigma) = \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}) \rightarrow \nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x}) \end{cases}\).
    • \(\mathcal{L}_{\text{DSM}}\) is equivalent to \(\mathcal{L}_{\text{SM}}\).
      • i.e.) \(\exists C\in\mathbb{R} \text{ s.t. } \mathcal{L}_{\text{SM}}(\phi;\sigma) = \mathcal{L}_{\text{DSM}}(\phi;\sigma) + C\)
    • Denoising Score Matching is equivalent to denoising.
      • \(\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}\).
  • Sampling
    • Langevin dynamics
      • \(\tilde{\mathbf{x}}_{n+1} = \tilde{\mathbf{x}}_{n} + \eta \underbrace{s_{\phi^\times}(\tilde{\mathbf{x}}_{n})}_{\approx\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}_n)} + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
  • Advantage)
    • Well-defined gradients
      • Injected noise provides full support in $\mathbb{R}^D$.
        • cf.) Recall that data point is a low-dimensional manifold.
      • Thus, the score function is well defined everywhere.
    • Improved coverage.
      • Injected noise smooths out sparse regions between modes.
      • This enhances the training signal quality and facilitates the sampling.
  • Limit)
    • Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
      • Low Noise Level
        • Langevin dynamics struggles to traverse modes in multi-modal distributions due to vanishing gradients in low-density regions.
      • High Noise Level
        • The model captures only coarse structures, resulting in blurry samples.
    • Langevin dynamics can be slow to converge or even fail in high-dimensional spaces.


E.g.) Additive Gaussian Noise DSM

  • Model)
    • Perturbation : $\tilde{\mathbf{x}} = \mathbf{x} + \sigma\epsilon,\quad \epsilon\sim\mathcal{N}(\mathbf{0, I})$
    • Then
      • \(p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\; \mathbf{x}, \sigma^2\mathbf{I})\).
      • \(\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \displaystyle\frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2}\).
    • Thus, we may further simplify the loss as…
      \(\begin{aligned} \mathcal{L}_{\text{DSM}}(\phi, \sigma) &= \frac{1}{2}\mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right] \\ &= \frac{1}{2}\mathbb{E}_{\mathbf{x}, \epsilon} \left[ \left\Vert s_\phi(\mathbf{x}+\sigma\epsilon;\sigma) + \frac{\epsilon}{\sigma} \right\Vert_2^2 \right] \\ \end{aligned}\).


Concept) Tweedie’s Formula

  • Goal)
    • We want to show that DSM is equivalent to denoising.
  • Settings)
    • Denoising
      • We may define denoising as $\mathbf{x}\mid\tilde{\mathbf{x}}$.
        • i.e.) getting the clean data $\mathbf{x}$ given the noisy data $\tilde{\mathbf{x}}$.
      • The expected value of such values can be a reasonable estimate
        • \(\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x}\mid\tilde{\mathbf{x}})}[\mathbf{x}\mid\tilde{\mathbf{x}}]\).
  • Result)
    • \(\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}\).
  • Meaning)
    • Optimizing the DSM score and Langevin sampling from it is a perfect denoiser.
Derivation. - Assume that the perturbation kernel (or conditional distribution of $\tilde{\mathbf{x}}$ given $\mathbf{x}$) is Gaussian. $$\displaystyle p(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\; \alpha\mathbf{x}, \sigma^2\mathbf{I}) \varpropto \exp\left(-\frac{\Vert\tilde{\mathbf{x}} - \alpha\mathbf{x}\Vert^2}{2\sigma^2}\right)$$ --- cf.) Other exponential family distribution works thanks to the deferentiation rule below.
- Then, we may get the gradient of the prior distribution as $$\displaystyle\nabla_{\tilde{\mathbf{x}}}\; p(\tilde{\mathbf{x}}\mid \mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\;p(\tilde{\mathbf{x}}\mid\mathbf{x}) \quad\cdots\quad (A)$$ - Now, consider that DSM score can be rewritten as $$\begin{aligned} \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} p_\sigma(\tilde{\mathbf{x}}) \\ &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} \left( \int p(\tilde{\mathbf{x}}\mid\mathbf{x})p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \right) \\ &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \int \nabla_{\tilde{\mathbf{x}}} p(\tilde{\mathbf{x}}\mid\mathbf{x})p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \\ &= \int \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \underbrace{\nabla_{\tilde{\mathbf{x}}} p(\tilde{\mathbf{x}}\mid\mathbf{x})}_{(A)} p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \\ &= -\int \frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\; \frac{p(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x})}{p_\sigma(\tilde{\mathbf{x}})} \text{d}\mathbf{x} \\ &= -\int \frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\; p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} & (\because\text{Bayes' Rule}) \\ &= - \frac{1}{\sigma^2} \left( \tilde{\mathbf{x}} \int p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} - \alpha \int \mathbf{x} p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} \right) \\ &= - \frac{1}{\sigma^2} \left( \tilde{\mathbf{x}} - \alpha \mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] \right) \\ \end{aligned}$$ - Above can be rewritten as $$\displaystyle\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] = \frac{1}{\alpha} \left( \tilde{\mathbf{x}} + \sigma^2\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)$$



3.4 Multi-Noise Levels of Denoising Score Matching

Model) Noise Conditional Score Networks (NCSN)

Song et al 2019

  • Idea)
    • Recall that DSM’s single fixed noise scale $\sigma$ was a bottleneck.
    • Instead,
      • inject Gaussian noise at multiple levels into the data distribution
      • jointly train a noise-conditional score network (NCSN) to estimate score functions across a range of noise scales.
  • Training)
    • Settings)
      • \(0\lt\sigma_{1}\lt\sigma_{2}\lt\cdots\lt\sigma_{L}\) : a sequence of $L$ noise levels
        • $\sigma_{1}$ is small enough to preserve most of the data’s fine details
        • $\sigma_{L}$ is large enough to sufficiently smooth the distribution, fascilitating easier training.
      • Perturbation Kernel : \(p_\sigma(\mathbf{x}_\sigma\mid\mathbf{x}) := \mathcal{N}(\mathbf{x}_\sigma;\; \mathbf{x}, \sigma^2\mathbf{I}\)
      • Marginal Distribution : \(p_\sigma(\mathbf{x}_\sigma) = \displaystyle\int p_\sigma(\mathbf{x}_\sigma\mid\mathbf{x}) p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x}\)
    • Loss
      • \(\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)\),
        • where
          • \(\mathcal{L}_{\text{DSM}}(\phi;\sigma_i) = \frac{1}{2} \mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right]\).
          • $\lambda(\sigma_i)\gt0$ : a weighting function for each scale
    • Result)
      • \(s^*(\mathbf{x}_\sigma, \sigma) = \nabla_{\mathbf{x}_\sigma}\log p_\sigma(\mathbf{x}_\sigma)\).
        • i.e.) Optimal score for each noise level $\sigma_i$
      • Prop.)
  • Sampling)
    • Annealed Langevin Dynamics
      • cf.) Step Size
        • Typically scaled by the noise level as
          • \(\eta_\ell = \delta\cdot\frac{\sigma_\ell^2}{\sigma_1^2}\).
  • Prop.)
    • Slow Sampling speed.
      • $O(LK)$ where $L$ is the number of noise scales and $K$ is the iterative step in each scale during the annealed Langevin Dynamics.



Analysis) NCSN vs DDPM




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Score-Based Generative Modeling through Stochastic Differential Equation
  • Guiding a Diffusion Model with a Bad Version of Itself (Autoguidance)
  • (DM Reconst.) Ch.2 Variational Perspective - From VAEs to DDPM
  • Variational Autoencoder Bayes (VAE)
  • Denoising Diffusion Probabilistic Models (DDPM)