(DM Reconst.) Ch.3 Score-Based Perspective - From EBMs to NCSN

Diffusion Model Conceptual Reconstruction following The Principles of Diffusion Models

Hozy Summary

EBM $\rightarrow$ Score Matching $\rightarrow$ Score-Based Generative Model $\rightarrow$ Sliced Score Matching / Denoising Score Matching $\rightarrow$ NCSN

EBM
- Def.)
  - a model that assigns low energy (stability) on the probable datapoint.
  - $\displaystyle p_\phi(\mathbf{x}) := \frac{\exp{(-E_\phi(\mathbf{x}))}}{Z_\phi}$.
- Drawback)
  - Intractable $Z_\phi$ and its gradient.
Score Matching
- Def.)
  - Calculate the score function which is the gradient of the log-density
    - $s(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x}),\quad s:\mathbb{R}^D\rightarrow\mathbb{R}^D$.
- Why it works?
  - $\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})$.
    - where
      - $\tilde{p}(\mathbf{x})$ : an unnormalized density
        
        e.g.) $\exp(-E_\phi(\mathbf{x}))$ in EBMs.
      - $p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z},\quad Z = \int \tilde{p}(\mathbf{x})\text{d}\mathbf{x}$.
- Advantage)
  - Tractability : $Z_\phi$ is now gone!
  - Complete Representation of the Underlying distribution.
- Drawback)
  - Needs 2nd order derivative, which is computationally prohibitive in high dimensions.
- Sampling Method)
  - Langevin SDE
    - Def.)
      - Discrete
        
        $\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n$,
        
        for $\eta\gt0, n=0,1,\ldots, \epsilon_n\sim\mathcal{N}(\mathbf{0, I})$
      - Continuous ($\eta\rightarrow0$)
        
        $\text{d}\mathbf{x}(t) = \nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)$.
    - Advantage)
      - Solving simple Langevin SDE is equivalent to sampling from the estimated distribution.
        
        Why?) Under standard regularity assumptions, the distribution of $\mathbf{x}(t)$ converges to $p_\phi$ as $t\rightarrow\infty$.
    - Limit)
      - Sensitivity to hyperparmeters
      - Poor Mixing Time
Score-Based Generative Model
- Idea)
  - Let neural network to parameterize the score function directly
    - $s_\phi(\mathbf{x}) \approx s(\mathbf{x})$.
- Loss)
  $\begin{aligned} \tilde{\mathcal{L}}_{\text{SM}}(\phi) &:= \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right] \\ &\equiv \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right) + \frac{1}{2} \Vert s_\phi(\mathbf{x}) \Vert_2^2 \right] \\ \end{aligned}$.
- Sampling)
  - Discrete
    - $\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})$.
  - Continuous
    - $\text{d}\mathbf{x}(t) = s_{\phi^\times}(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)$.
- Drawback)
  - $O(D^2)$ cost for training
    - cf.) $\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)$ from $\tilde{\mathcal{L}}_{\text{SM}}(\phi)$
- Improvements)
  - Sliced Score Matching
    - $\tilde{\mathcal{L}}_{\text{SM}}(\phi) = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]$.
      - i.e.) Replaced the trace of the Jacobian with VJP operation in automatic differentiation
      - Still the reliance on the raw data makes the training fragile
  - Denoising Score Matching
    - $\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]$.
      - i.e.) Perturb the data with noise with fixed scale of $\sigma$ and condition on the data.
      - e.g.) Additive Gaussian Noise
        
        $\mathcal{L}_{\text{DSM}}(\phi, \sigma) = \frac{1}{2}\mathbb{E}_{\mathbf{x}, \epsilon} \left[ \left\Vert s_\phi(\mathbf{x}+\sigma\epsilon;\sigma) + \frac{\epsilon}{\sigma} \right\Vert_2^2 \right]$.
    - By Tweedie’s Formula, DSM is equivalent to denoising.
    - Limit)
      - Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
      - Langevin dynamics can be slow to converge or even fail in high-dimensional spaces.
Noise Conditional Score Network (NCSN)
- Idea)
  - Inject Gaussian noise at multiple levels into the data distribution
  - Jointly train to estimate score functions across a range of noise scales.
- Loss)
  - $\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)$,
    - where
      - $\mathcal{L}_{\text{DSM}}(\phi;\sigma_i) = \frac{1}{2} \mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right]$.
      - $\lambda(\sigma_i)\gt0$ : a weighting function for each scale
- Sampling)
  - Annealed Langevin dynamics

3.1 Energy Based Models (EBMs)

Def.) Energy Function

Settings)
- $\mathbf{x}\in\mathbb{R}^D$ : a data point
Def.)
- $E_\phi(\mathbf{x})$ : an energy function parameterized by $\phi$ s.t.
  - assigns lower energy to more likely configureations
  - $\displaystyle p_\phi(\mathbf{x}) := \frac{\exp{(-E_\phi(\mathbf{x}))}}{Z_\phi}$.
  - $\displaystyle Z_\phi := \int_{\mathbb{R}^D}\exp(-E_{\phi}(\mathbf{x}))\text{d}\mathbf{x}$ : the partition funciton ensuring normalization.
    - i.e.) $\displaystyle\int_{\mathbb{R}^D} \displaystyle p_\phi(\mathbf{x})\text{d}\mathbf{x} = 1$
Prop.)
- Lower energy corresponds to higher probability.
  - In our setting $\frac{\partial }{\partial E_\phi(\mathbf{x})} \exp{(-E_\phi(\mathbf{x}))} \lt 0$
  - More stable state.
- As the model trains (well),
  - it assigns high(low) probabilities on more(less) probable datapoints as the image below.
  - the probability mass is redistributed across the entire space rather than assigned independently to each region.

Training)
- MLE (log)
  - Intractable
    - $\log Z_\phi$ and its gradient
    - Alternative Sols.)
      - Contrastive divergence
      - Score matching

Concept) Score

Settings)
- $p(\mathbf{x}) : \mathbb{R}^D \rightarrow [0,\infty)$ : a density
Def.)
- The score function is the gradient of the log-density
  - $s(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x}),\quad s:\mathbb{R}^D\rightarrow\mathbb{R}^D$.
Prop.)
- The score forms a vector field that points toward regions of higher probability.
Advantage)
- Tractability
  - Why?) The partial function $Z$ cancels out.
    - For
      - $\tilde{p}(\mathbf{x})$ : an unnormalized density
        
        e.g.) $\exp(-E_\phi(\mathbf{x}))$ in EBMs.
      - $p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z},\quad Z = \int \tilde{p}(\mathbf{x})\text{d}\mathbf{x}$,
    - $\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})$.
- Complete Representation of the Underlying distribution
  - $\log p(\mathbf{x}) = \log p(\mathbf{x}_0) + \displaystyle\int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t$.
    - where
      - $\mathbf{x}_0$ is a reference point
      - $\log p(\mathbf{x}_0)$ is fixed by normalization
    - cf.)
      $\begin{aligned} \int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t &= \int_0^1 s(r(t))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t & (\text{Put } r(t) = \mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0)) \end{aligned}$ : interpolation

Model) Score Matching

Hyvärinen and Dayan, 2005

Idea)
- Match the data and model scores!
Loss
$\begin{aligned} \mathcal{L}_{\text{SM}}(\phi) &= \frac{1}{2} \mathbb{E}_{p_{\text{data}(\mathbf{x})}} \left\Vert \nabla_\mathbf{x}\log p_\phi(\mathbf{x}) - \nabla_\mathbf{x}\log p_{\text{data}}(\mathbf{x}) \right\Vert_2^2 \\ &= \mathbb{E}_{p_{\text{data}(\mathbf{x})}} \left[ \text{Tr}\left(\nabla_\mathbf{x}^2 E_\phi(\mathbf{x})\right) + \frac{1}{2} \Vert \nabla_\mathbf{x} E_\phi(\mathbf{x}) \Vert_2^2 \right] + C \\ \end{aligned}$.
- where
  - $\nabla_\mathbf{x}^2 E_\phi(\mathbf{x})$ is the Hessian of $E_\phi$
  - $C$ is a constant independent of $\phi$.
- Advantage)
  - $Z$ is gone!
  - Avoids sampling from the model during training.
- Drawback)
  - Needs 2nd order derivative, which is computationally prohibitive in high dimensions.

Concept) Langevin Sampling

Concept) Discrete-Time Langevin Dynamics

Def.)
$\begin{aligned} \mathbf{x}_{n+1} &= \mathbf{x}_{n} - \eta\nabla_{\mathbf{x}} E_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n & n=0,1,\ldots \\ &= \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n & (\because\text{By def. of the score}) \end{aligned}$.
- where
  - $\mathbf{x}_{0}$ is initialized from some distribution.
  - $\eta\gt0$ : the step size
  - $\epsilon_n\sim\mathcal{N}(\mathbf{0,I})$ : Gaussian noise
Prop.)
- $\epsilon_n$ enables exploration beyond local minima by adding stochasticity.
- The score function $\log p_\phi(\mathbf{x}_{n})$ guides the samples toward high-density regions.

Concept) Langevin Stochastic Differential Equation (SDE)

Idea)
- $\eta\rightarrow0$ on Discrete-Time Langevin Dynamics.
Def.)
- $\text{d}\mathbf{x}(t) = \nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)$.
Prop.)
- Discrete-time update rule serves as the Euler-Maruyama discretization of this continuous SDE.
- Under standard regularity assumptions, the distribution of $\mathbf{x}(t)$ converges to $p_\phi$ as $t\rightarrow\infty$.
  - Thus, we may sample by simply solving this SDE.
Limit)
- In high-dimensional spaces,
  - its efficiency is highly sensitive to…
    - step size $\eta$
    - noise scale
    - number of iterations
  - it has the issue of poor mixing time.
    - Desc.)
      - Requires extremely long time to transition between regions of high probability
      - The more the data distribution is mixed with many isolated modes, the longer it takes.
        
        Why?) Recall that it relies on the local stochastic updates to escape the local minima
        
        $\sqrt{2}\text{d}\mathbf{w}(t)$.

3.2 Score-Based Generative Model

Tech.) Parameterizing Score Function with Neural Network

Ideation)
- Recall that the score $s(\mathbf{x})=\nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x})$ is enough to reconstruct $p_{\text{data}}$.
- Then, why don’t we train our model to directly parameterize the score?
  - i.e.) $s_\phi(\mathbf{x}) \approx s(\mathbf{x})$
- We may set up a naive MSE loss as below.
  - $\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right]$.
    - However, we don’t know $s(\mathbf{x})$.
- Hyvärinen and Dayan (2005) suggested an equivalent objective that depends only on $s_\phi(\mathbf{x})$.

Props.)
- When we solve the $\tilde{\mathcal{L}}_{\text{SM}}(\phi)$ minimization problem…
  - the divergence term $\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)$ favors negative value
  - the norm term $\frac{1}{2} \Vert s_\phi(\mathbf{x}) \Vert_2^2$ suppresses the score $s_\phi$
    - Thus, the result stationary points act as a attractive sinks.
      - i.e.) The points where the dynamics terminates!
      - p66 for details on stationarity and sink.
Sampling
- Same Langevin dynamics can be used for sampling plugging in the frozen oracle score as…
  - Discrete
    - $\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})$.
      - cf.) $\eta s_{\phi^\times}(\mathbf{x}_{n})$ is updated in the direction that increases the density. (Opposite direction of the energy $E_\phi$ case.)
  - Continuous
    - $\text{d}\mathbf{x}(t) = s_{\phi^\times}(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)$.
Drawback)
- Training is still too costly.
  - Why?)
    - Calculating $\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)$ takes $O(D^2)$ times.

Model) Sliced Score Matching

Song et al. 2020

Idea)
- Simplify the loss by adopting the isotropic random vector and the Hutchinson’s identity $\mathbf{u}\in\mathbb{R}^D$
Model)
- For a isotropic random vector $\mathbf{u}\in\mathbb{R}^D$
- $\tilde{\mathcal{L}}_{\text{SM}} = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]$.
Advantage)
- Cheaper computation
  - No more Jacobian nor Hessian
  - Simply utilize the automatic differentiation
    - Jacobian- and vector-Jacobian-product operations (JVP/VJP)
      - Meaning)
        
        Check the model’s behavior along random directions : the projected score is nudged to align with regions of higher data density, so data points become stationary in expectation.
Limit)
- Its reliance on the raw data distribution makes the training fragile.
  - Why?)
    - for image data lying on low-dimensional manifolds, the score $\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})$ may be undefined or unstable
    - the method only constrains the vector field at observed points, providing weak control in their neighborhoods

Concept) Isotropic Random Vector & Hutchinson’s Identity

Def.)
- a random vector whose statistical properties are uniform in all directions
- i.e.) $\mathbb{E}\left[ \mathbf{u} \right] = \mathbf{0},\quad \mathbb{E}\bigg[\underbrace{ \mathbf{u}\mathbf{u}^\top }_{\text{outer prod}}\bigg] = \mathbf{I}$
Hutchinson’s Identity)
- $\text{Tr}(\mathbf{A}) = \mathbb{E}_{\mathbf{u}}\left[ \mathbf{u^\top A u} \right]$.
- $\mathbb{E}_{\mathbf{u}}\left[(\mathbf{u}^\top \mathbf{v})^2\right] = \Vert \mathbf{v}\Vert_2^2$.

Pf.

$$\begin{aligned} (1) &\; \text{Tr}(\mathbf{A}) = \text{Tr}(\mathbf{AI}) = \text{Tr}(\mathbf{A}\mathbb{E}[\mathbf{uu^\top}]) = \text{Tr}(\mathbb{E}[\mathbf{u^\top A u}]) \\ (2) &\; \mathbb{E}[(\mathbf{u^\top v})^2]=\mathbb{E}[\mathbf{(u^\top v)^\top (u^\top v)}] = \mathbb{E}[\mathbf{(v^\top u)(u^\top v)}] = \mathbb{E}[\mathbf{v^\top v}] = \Vert \mathbf{v} \Vert_2^2 \\ \end{aligned}$$

3.3 Denoising Score Matching

Vincent, 2011

Idea)
- Recall we had the originally parameterized loss of $\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - \nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x}) \Vert_2^2 \right]$.
- Here, the score $\nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x})$ was intractable.
- What if we replace $p_{\text{data}}$ with marginally perturbed distribution $p_{\sigma}(\tilde{\mathbf{x}})$?
  - $p_{\sigma}(\tilde{\mathbf{x}}) = \displaystyle\int p_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x}) \text{d}\mathbf{x}$.
    - i.e.) Injecting noise into the data $\mathbf{x}\sim p_{\text{data}}$ via a known conditional distribution $p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})$ with scale $\sigma$
- Although, $\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}})$ is intractable
  - Vincent showed that conditioning on $\mathbf{x}\sim p_{\text{data}}$ yields an equivalent tractable objective : Denoising Score Matching (DSM)
Model)
- $\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]$.
Props.)
- The optimal minimizer $s^*(\tilde{\mathbf{x}};\sigma)$ satisfies
  - $s^*(\tilde{\mathbf{x}};\sigma) = \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}})$.
- $\sigma\rightarrow0\Rightarrow\begin{cases} p_\sigma(\tilde{\mathbf{x}}) \rightarrow p_{\text{data}}(\mathbf{x}) \\ s^*(\tilde{\mathbf{x}};\sigma) = \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}) \rightarrow \nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x}) \end{cases}$.
- $\mathcal{L}_{\text{DSM}}$ is equivalent to $\mathcal{L}_{\text{SM}}$.
  - i.e.) $\exists C\in\mathbb{R} \text{ s.t. } \mathcal{L}_{\text{SM}}(\phi;\sigma) = \mathcal{L}_{\text{DSM}}(\phi;\sigma) + C$
- Denoising Score Matching is equivalent to denoising.
  - $\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}$.
    - Refer to Tweedie’s Formula below.
Sampling
- Langevin dynamics
  - $\tilde{\mathbf{x}}_{n+1} = \tilde{\mathbf{x}}_{n} + \eta \underbrace{s_{\phi^\times}(\tilde{\mathbf{x}}_{n})}_{\approx\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}_n)} + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})$.
Advantage)
- Well-defined gradients
  - Injected noise provides full support in $\mathbb{R}^D$.
    - cf.) Recall that data point is a low-dimensional manifold.
  - Thus, the score function is well defined everywhere.
- Improved coverage.
  - Injected noise smooths out sparse regions between modes.
  - This enhances the training signal quality and facilitates the sampling.
Limit)
- Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
  - Low Noise Level
    - Langevin dynamics struggles to traverse modes in multi-modal distributions due to vanishing gradients in low-density regions.
  - High Noise Level
    - The model captures only coarse structures, resulting in blurry samples.
- Langevin dynamics can be slow to converge or even fail in high-dimensional spaces.

E.g.) Additive Gaussian Noise DSM

Model)
- Perturbation : $\tilde{\mathbf{x}} = \mathbf{x} + \sigma\epsilon,\quad \epsilon\sim\mathcal{N}(\mathbf{0, I})$
- Then
  - $p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\; \mathbf{x}, \sigma^2\mathbf{I})$.
  - $\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \displaystyle\frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2}$.
- Thus, we may further simplify the loss as…
  $\begin{aligned} \mathcal{L}_{\text{DSM}}(\phi, \sigma) &= \frac{1}{2}\mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right] \\ &= \frac{1}{2}\mathbb{E}_{\mathbf{x}, \epsilon} \left[ \left\Vert s_\phi(\mathbf{x}+\sigma\epsilon;\sigma) + \frac{\epsilon}{\sigma} \right\Vert_2^2 \right] \\ \end{aligned}$.

Concept) Tweedie’s Formula

Goal)
- We want to show that DSM is equivalent to denoising.
Settings)
- Denoising
  - We may define denoising as $\mathbf{x}\mid\tilde{\mathbf{x}}$.
    - i.e.) getting the clean data $\mathbf{x}$ given the noisy data $\tilde{\mathbf{x}}$.
  - The expected value of such values can be a reasonable estimate
    - $\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x}\mid\tilde{\mathbf{x}})}[\mathbf{x}\mid\tilde{\mathbf{x}}]$.
Result)
- $\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}$.
Meaning)
- Optimizing the DSM score and Langevin sampling from it is a perfect denoiser.

Derivation.

- Assume that the perturbation kernel (or conditional distribution of $\tilde{\mathbf{x}}$ given $\mathbf{x}$) is Gaussian. $$\displaystyle p(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\; \alpha\mathbf{x}, \sigma^2\mathbf{I}) \varpropto \exp\left(-\frac{\Vert\tilde{\mathbf{x}} - \alpha\mathbf{x}\Vert^2}{2\sigma^2}\right)$$ --- cf.) Other exponential family distribution works thanks to the deferentiation rule below.
- Then, we may get the gradient of the prior distribution as $$\displaystyle\nabla_{\tilde{\mathbf{x}}}\; p(\tilde{\mathbf{x}}\mid \mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\;p(\tilde{\mathbf{x}}\mid\mathbf{x}) \quad\cdots\quad (A)$$ - Now, consider that DSM score can be rewritten as $$\begin{aligned} \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} p_\sigma(\tilde{\mathbf{x}}) \\ &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} \left( \int p(\tilde{\mathbf{x}}\mid\mathbf{x})p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \right) \\ &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \int \nabla_{\tilde{\mathbf{x}}} p(\tilde{\mathbf{x}}\mid\mathbf{x})p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \\ &= \int \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \underbrace{\nabla_{\tilde{\mathbf{x}}} p(\tilde{\mathbf{x}}\mid\mathbf{x})}_{(A)} p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \\ &= -\int \frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\; \frac{p(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x})}{p_\sigma(\tilde{\mathbf{x}})} \text{d}\mathbf{x} \\ &= -\int \frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\; p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} & (\because\text{Bayes' Rule}) \\ &= - \frac{1}{\sigma^2} \left( \tilde{\mathbf{x}} \int p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} - \alpha \int \mathbf{x} p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} \right) \\ &= - \frac{1}{\sigma^2} \left( \tilde{\mathbf{x}} - \alpha \mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] \right) \\ \end{aligned}$$ - Above can be rewritten as $$\displaystyle\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] = \frac{1}{\alpha} \left( \tilde{\mathbf{x}} + \sigma^2\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)$$

3.4 Multi-Noise Levels of Denoising Score Matching

Model) Noise Conditional Score Networks (NCSN)

Song et al 2019

Idea)
- Recall that DSM’s single fixed noise scale $\sigma$ was a bottleneck.
- Instead,
  - inject Gaussian noise at multiple levels into the data distribution
  - jointly train a noise-conditional score network (NCSN) to estimate score functions across a range of noise scales.
Training)
- Settings)
  - $0\lt\sigma_{1}\lt\sigma_{2}\lt\cdots\lt\sigma_{L}$ : a sequence of $L$ noise levels
    - $\sigma_{1}$ is small enough to preserve most of the data’s fine details
    - $\sigma_{L}$ is large enough to sufficiently smooth the distribution, fascilitating easier training.
  - Perturbation Kernel : $p_\sigma(\mathbf{x}_\sigma\mid\mathbf{x}) := \mathcal{N}(\mathbf{x}_\sigma;\; \mathbf{x}, \sigma^2\mathbf{I}$
  - Marginal Distribution : $p_\sigma(\mathbf{x}_\sigma) = \displaystyle\int p_\sigma(\mathbf{x}_\sigma\mid\mathbf{x}) p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x}$
- Loss
  - $\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)$,
    - where
      - $\mathcal{L}_{\text{DSM}}(\phi;\sigma_i) = \frac{1}{2} \mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right]$.
      - $\lambda(\sigma_i)\gt0$ : a weighting function for each scale
- Result)
  - $s^*(\mathbf{x}_\sigma, \sigma) = \nabla_{\mathbf{x}_\sigma}\log p_\sigma(\mathbf{x}_\sigma)$.
    - i.e.) Optimal score for each noise level $\sigma_i$
  - Prop.)
    - By Tweedie’s Formula, this is equivalent to the Bayes optimal noise predictor under DDPM loss.
      - Refer to x-prediction.
Sampling)
- Annealed Langevin Dynamics
  - cf.) Step Size
    - Typically scaled by the noise level as
      - $\eta_\ell = \delta\cdot\frac{\sigma_\ell^2}{\sigma_1^2}$.

Prop.)
- Slow Sampling speed.
  - $O(LK)$ where $L$ is the number of noise scales and $K$ is the iterative step in each scale during the annealed Langevin Dynamics.

Analysis) NCSN vs DDPM