(DM Reconst.) Ch.3 Score-Based Perspective - From EBMs to NCSN
Diffusion Model Conceptual Reconstruction following The Principles of Diffusion Models
Hozy Summary
EBM $\rightarrow$ Score Matching $\rightarrow$ Score-Based Generative Model $\rightarrow$ Sliced Score Matching / Denoising Score Matching $\rightarrow$ NCSN
- EBM
- Def.)
- a model that assigns low energy (stability) on the probable datapoint.
- \(\displaystyle p_\phi(\mathbf{x}) := \frac{\exp{(-E_\phi(\mathbf{x}))}}{Z_\phi}\).
- Drawback)
- Intractable $Z_\phi$ and its gradient.
- Def.)
- Score Matching
- Def.)
- Calculate the score function which is the gradient of the log-density
- \(s(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x}),\quad s:\mathbb{R}^D\rightarrow\mathbb{R}^D\).
- Calculate the score function which is the gradient of the log-density
- Why it works?
- \(\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})\).
- where
- $\tilde{p}(\mathbf{x})$ : an unnormalized density
- e.g.) $\exp(-E_\phi(\mathbf{x}))$ in EBMs.
- \(p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z},\quad Z = \int \tilde{p}(\mathbf{x})\text{d}\mathbf{x}\).
- $\tilde{p}(\mathbf{x})$ : an unnormalized density
- where
- \(\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})\).
- Advantage)
- Tractability : $Z_\phi$ is now gone!
- Complete Representation of the Underlying distribution.
- Drawback)
- Needs 2nd order derivative, which is computationally prohibitive in high dimensions.
- Sampling Method)
- Langevin SDE
- Def.)
- Discrete
- \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n\),
- for \(\eta\gt0, n=0,1,\ldots, \epsilon_n\sim\mathcal{N}(\mathbf{0, I})\)
- \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n\),
- Continuous ($\eta\rightarrow0$)
- \(\text{d}\mathbf{x}(t) = \nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
- Discrete
- Advantage)
- Solving simple Langevin SDE is equivalent to sampling from the estimated distribution.
- Why?) Under standard regularity assumptions, the distribution of $\mathbf{x}(t)$ converges to $p_\phi$ as $t\rightarrow\infty$.
- Solving simple Langevin SDE is equivalent to sampling from the estimated distribution.
- Limit)
- Sensitivity to hyperparmeters
- Poor Mixing Time
- Def.)
- Langevin SDE
- Def.)
- Score-Based Generative Model
- Idea)
- Let neural network to parameterize the score function directly
- \(s_\phi(\mathbf{x}) \approx s(\mathbf{x})\).
- Let neural network to parameterize the score function directly
- Loss)
\(\begin{aligned} \tilde{\mathcal{L}}_{\text{SM}}(\phi) &:= \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right] \\ &\equiv \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right) + \frac{1}{2} \Vert s_\phi(\mathbf{x}) \Vert_2^2 \right] \\ \end{aligned}\). - Sampling)
- Discrete
- \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
- Continuous
- \(\text{d}\mathbf{x}(t) = s_{\phi^\times}(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
- Discrete
- Drawback)
- $O(D^2)$ cost for training
- cf.) \(\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)\) from \(\tilde{\mathcal{L}}_{\text{SM}}(\phi)\)
- $O(D^2)$ cost for training
- Improvements)
- Sliced Score Matching
- \(\tilde{\mathcal{L}}_{\text{SM}}(\phi) = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]\).
- i.e.) Replaced the trace of the Jacobian with VJP operation in automatic differentiation
- Still the reliance on the raw data makes the training fragile
- \(\tilde{\mathcal{L}}_{\text{SM}}(\phi) = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]\).
- Denoising Score Matching
- \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]\).
- i.e.) Perturb the data with noise with fixed scale of $\sigma$ and condition on the data.
- e.g.) Additive Gaussian Noise
- \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) = \frac{1}{2}\mathbb{E}_{\mathbf{x}, \epsilon} \left[ \left\Vert s_\phi(\mathbf{x}+\sigma\epsilon;\sigma) + \frac{\epsilon}{\sigma} \right\Vert_2^2 \right]\).
- By Tweedie’s Formula, DSM is equivalent to denoising.
- Limit)
- Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
- Langevin dynamics can be slow to converge or even fail in high-dimensional spaces.
- \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]\).
- Sliced Score Matching
- Idea)
- Noise Conditional Score Network (NCSN)
- Idea)
- Inject Gaussian noise at multiple levels into the data distribution
- Jointly train to estimate score functions across a range of noise scales.
- Loss)
- \(\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)\),
- where
- \(\mathcal{L}_{\text{DSM}}(\phi;\sigma_i) = \frac{1}{2} \mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right]\).
- $\lambda(\sigma_i)\gt0$ : a weighting function for each scale
- where
- \(\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)\),
- Sampling)
- Annealed Langevin dynamics
- Idea)
3.1 Energy Based Models (EBMs)
Def.) Energy Function
- Settings)
- \(\mathbf{x}\in\mathbb{R}^D\) : a data point
- Def.)
- \(E_\phi(\mathbf{x})\) : an energy function parameterized by \(\phi\) s.t.
- assigns lower energy to more likely configureations
- \(\displaystyle p_\phi(\mathbf{x}) := \frac{\exp{(-E_\phi(\mathbf{x}))}}{Z_\phi}\).
- \(\displaystyle Z_\phi := \int_{\mathbb{R}^D}\exp(-E_{\phi}(\mathbf{x}))\text{d}\mathbf{x}\) : the partition funciton ensuring normalization.
- i.e.) \(\displaystyle\int_{\mathbb{R}^D} \displaystyle p_\phi(\mathbf{x})\text{d}\mathbf{x} = 1\)
- \(E_\phi(\mathbf{x})\) : an energy function parameterized by \(\phi\) s.t.
- Prop.)
- Lower energy corresponds to higher probability.
- In our setting \(\frac{\partial }{\partial E_\phi(\mathbf{x})} \exp{(-E_\phi(\mathbf{x}))} \lt 0\)
- More stable state.
- As the model trains (well),
- it assigns high(low) probabilities on more(less) probable datapoints as the image below.
- the probability mass is redistributed across the entire space rather than assigned independently to each region.
- Lower energy corresponds to higher probability.
- Training)
- MLE (log)
- Intractable
- \(\log Z_\phi\) and its gradient
- Alternative Sols.)
- Contrastive divergence
- Score matching
- Intractable
- MLE (log)
Concept) Score
- Settings)
- \(p(\mathbf{x}) : \mathbb{R}^D \rightarrow [0,\infty)\) : a density
- Def.)
- The score function is the gradient of the log-density
- \(s(\mathbf{x}) := \nabla_{\mathbf{x}} \log p(\mathbf{x}),\quad s:\mathbb{R}^D\rightarrow\mathbb{R}^D\).
- The score function is the gradient of the log-density
- Prop.)
- The score forms a vector field that points toward regions of higher probability.
- Advantage)
- Tractability
- Why?) The partial function $Z$ cancels out.
- For
- $\tilde{p}(\mathbf{x})$ : an unnormalized density
- e.g.) $\exp(-E_\phi(\mathbf{x}))$ in EBMs.
- \(p(\mathbf{x}) = \frac{\tilde{p}(\mathbf{x})}{Z},\quad Z = \int \tilde{p}(\mathbf{x})\text{d}\mathbf{x}\),
- $\tilde{p}(\mathbf{x})$ : an unnormalized density
- \(\nabla_{\mathbf{x}}\log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}}\log Z}_{=0} = \nabla_{\mathbf{x}}\log \tilde{p}(\mathbf{x})\).
- For
- Why?) The partial function $Z$ cancels out.
- Complete Representation of the Underlying distribution
- \(\log p(\mathbf{x}) = \log p(\mathbf{x}_0) + \displaystyle\int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t\).
- where
- $\mathbf{x}_0$ is a reference point
- $\log p(\mathbf{x}_0)$ is fixed by normalization
- cf.)
\(\begin{aligned} \int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t &= \int_0^1 s(r(t))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t & (\text{Put } r(t) = \mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0)) \end{aligned}\) : interpolation
- where
- \(\log p(\mathbf{x}) = \log p(\mathbf{x}_0) + \displaystyle\int_0^1 s(\mathbf{x}_0 + t(\mathbf{x} - \mathbf{x}_0))^\top (\mathbf{x}-\mathbf{x}_0)\text{d}t\).
- Tractability
Model) Score Matching
Hyvärinen and Dayan, 2005
- Idea)
- Match the data and model scores!
- Loss
\(\begin{aligned} \mathcal{L}_{\text{SM}}(\phi) &= \frac{1}{2} \mathbb{E}_{p_{\text{data}(\mathbf{x})}} \left\Vert \nabla_\mathbf{x}\log p_\phi(\mathbf{x}) - \nabla_\mathbf{x}\log p_{\text{data}}(\mathbf{x}) \right\Vert_2^2 \\ &= \mathbb{E}_{p_{\text{data}(\mathbf{x})}} \left[ \text{Tr}\left(\nabla_\mathbf{x}^2 E_\phi(\mathbf{x})\right) + \frac{1}{2} \Vert \nabla_\mathbf{x} E_\phi(\mathbf{x}) \Vert_2^2 \right] + C \\ \end{aligned}\).- where
- \(\nabla_\mathbf{x}^2 E_\phi(\mathbf{x})\) is the Hessian of $E_\phi$
- $C$ is a constant independent of $\phi$.
- Advantage)
- $Z$ is gone!
- Avoids sampling from the model during training.
- Drawback)
- Needs 2nd order derivative, which is computationally prohibitive in high dimensions.
- where
Concept) Langevin Sampling
Concept) Discrete-Time Langevin Dynamics
- Def.)
\(\begin{aligned} \mathbf{x}_{n+1} &= \mathbf{x}_{n} - \eta\nabla_{\mathbf{x}} E_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n & n=0,1,\ldots \\ &= \mathbf{x}_{n} + \eta\nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n & (\because\text{By def. of the score}) \end{aligned}\).- where
- $\mathbf{x}_{0}$ is initialized from some distribution.
- $\eta\gt0$ : the step size
- $\epsilon_n\sim\mathcal{N}(\mathbf{0,I})$ : Gaussian noise
- where
- Prop.)
- $\epsilon_n$ enables exploration beyond local minima by adding stochasticity.
- The score function $\log p_\phi(\mathbf{x}_{n})$ guides the samples toward high-density regions.
Concept) Langevin Stochastic Differential Equation (SDE)
- Idea)
- $\eta\rightarrow0$ on Discrete-Time Langevin Dynamics.
- Def.)
- \(\text{d}\mathbf{x}(t) = \nabla_{\mathbf{x}} \log p_\phi(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
- Prop.)
- Discrete-time update rule serves as the Euler-Maruyama discretization of this continuous SDE.
- Under standard regularity assumptions, the distribution of $\mathbf{x}(t)$ converges to $p_\phi$ as $t\rightarrow\infty$.
- Thus, we may sample by simply solving this SDE.
- Limit)
- In high-dimensional spaces,
- its efficiency is highly sensitive to…
- step size $\eta$
- noise scale
- number of iterations
- it has the issue of poor mixing time.
- Desc.)
- Requires extremely long time to transition between regions of high probability
- The more the data distribution is mixed with many isolated modes, the longer it takes.
- Why?) Recall that it relies on the local stochastic updates to escape the local minima
- \(\sqrt{2}\text{d}\mathbf{w}(t)\).
- Why?) Recall that it relies on the local stochastic updates to escape the local minima
- Desc.)
- its efficiency is highly sensitive to…
- In high-dimensional spaces,
3.2 Score-Based Generative Model
Tech.) Parameterizing Score Function with Neural Network
- Ideation)
- Recall that the score \(s(\mathbf{x})=\nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x})\) is enough to reconstruct $p_{\text{data}}$.
- Then, why don’t we train our model to directly parameterize the score?
- i.e.) \(s_\phi(\mathbf{x}) \approx s(\mathbf{x})\)
- We may set up a naive MSE loss as below.
- \(\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right]\).
- However, we don’t know \(s(\mathbf{x})\).
- \(\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - s(\mathbf{x}) \Vert_2^2 \right]\).
- Hyvärinen and Dayan (2005) suggested an equivalent objective that depends only on $s_\phi(\mathbf{x})$.
- Props.)
- When we solve the \(\tilde{\mathcal{L}}_{\text{SM}}(\phi)\) minimization problem…
- the divergence term \(\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)\) favors negative value
- the norm term \(\frac{1}{2} \Vert s_\phi(\mathbf{x}) \Vert_2^2\) suppresses the score $s_\phi$
- Thus, the result stationary points act as a attractive sinks.
- i.e.) The points where the dynamics terminates!
- p66 for details on stationarity and sink.
- Thus, the result stationary points act as a attractive sinks.
- When we solve the \(\tilde{\mathcal{L}}_{\text{SM}}(\phi)\) minimization problem…
- Sampling
- Same Langevin dynamics can be used for sampling plugging in the frozen oracle score as…
- Discrete
- \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
- cf.) \(\eta s_{\phi^\times}(\mathbf{x}_{n})\) is updated in the direction that increases the density. (Opposite direction of the energy $E_\phi$ case.)
- \(\mathbf{x}_{n+1} = \mathbf{x}_{n} + \eta s_{\phi^\times}(\mathbf{x}_{n}) + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
- Continuous
- \(\text{d}\mathbf{x}(t) = s_{\phi^\times}(\mathbf{x}(t))\text{d}t + \sqrt{2}\text{d}\mathbf{w}(t)\).
- Discrete
- Same Langevin dynamics can be used for sampling plugging in the frozen oracle score as…
- Drawback)
- Training is still too costly.
- Why?)
- Calculating \(\text{Tr}\left(\nabla_\mathbf{x} s_\phi(\mathbf{x})\right)\) takes $O(D^2)$ times.
- Why?)
- Training is still too costly.
Model) Sliced Score Matching
- Idea)
- Simplify the loss by adopting the isotropic random vector and the Hutchinson’s identity $\mathbf{u}\in\mathbb{R}^D$
- Model)
- For a isotropic random vector $\mathbf{u}\in\mathbb{R}^D$
- \(\tilde{\mathcal{L}}_{\text{SM}} = \mathbb{E}_{\mathbf{x,u}} \left[ \mathbf{u}^\top(\nabla_\mathbf{x} s_\phi(\mathbf{x}))\mathbf{u} + \frac{1}{2}(\mathbf{u}^\top s_\phi(\mathbf{x}))^2 \right]\).
- Advantage)
- Cheaper computation
- No more Jacobian nor Hessian
- Simply utilize the automatic differentiation
- Jacobian- and vector-Jacobian-product operations (JVP/VJP)
- Meaning)
- Check the model’s behavior along random directions : the projected score is nudged to align with regions of higher data density, so data points become stationary in expectation.
- Meaning)
- Jacobian- and vector-Jacobian-product operations (JVP/VJP)
- Cheaper computation
- Limit)
- Its reliance on the raw data distribution makes the training fragile.
- Why?)
- for image data lying on low-dimensional manifolds, the score \(\nabla_\mathbf{x} \log p_\text{data}(\mathbf{x})\) may be undefined or unstable
- the method only constrains the vector field at observed points, providing weak control in their neighborhoods
- Why?)
- Its reliance on the raw data distribution makes the training fragile.
Concept) Isotropic Random Vector & Hutchinson’s Identity
- Def.)
- a random vector whose statistical properties are uniform in all directions
- i.e.) \(\mathbb{E}\left[ \mathbf{u} \right] = \mathbf{0},\quad \mathbb{E}\bigg[\underbrace{ \mathbf{u}\mathbf{u}^\top }_{\text{outer prod}}\bigg] = \mathbf{I}\)
- Hutchinson’s Identity)
- \(\text{Tr}(\mathbf{A}) = \mathbb{E}_{\mathbf{u}}\left[ \mathbf{u^\top A u} \right]\).
- \(\mathbb{E}_{\mathbf{u}}\left[(\mathbf{u}^\top \mathbf{v})^2\right] = \Vert \mathbf{v}\Vert_2^2\).
Pf.
$$\begin{aligned} (1) &\; \text{Tr}(\mathbf{A}) = \text{Tr}(\mathbf{AI}) = \text{Tr}(\mathbf{A}\mathbb{E}[\mathbf{uu^\top}]) = \text{Tr}(\mathbb{E}[\mathbf{u^\top A u}]) \\ (2) &\; \mathbb{E}[(\mathbf{u^\top v})^2]=\mathbb{E}[\mathbf{(u^\top v)^\top (u^\top v)}] = \mathbb{E}[\mathbf{(v^\top u)(u^\top v)}] = \mathbb{E}[\mathbf{v^\top v}] = \Vert \mathbf{v} \Vert_2^2 \\ \end{aligned}$$3.3 Denoising Score Matching
Vincent, 2011
- Idea)
- Recall we had the originally parameterized loss of \(\mathcal{L}_{\text{SM}}(\phi) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}} \left[ \Vert s_\phi(\mathbf{x}) - \nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x}) \Vert_2^2 \right]\).
- Here, the score \(\nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x})\) was intractable.
- What if we replace $p_{\text{data}}$ with marginally perturbed distribution \(p_{\sigma}(\tilde{\mathbf{x}})\)?
- \(p_{\sigma}(\tilde{\mathbf{x}}) = \displaystyle\int p_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x}) \text{d}\mathbf{x}\).
- i.e.) Injecting noise into the data $\mathbf{x}\sim p_{\text{data}}$ via a known conditional distribution \(p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x})\) with scale $\sigma$
- \(p_{\sigma}(\tilde{\mathbf{x}}) = \displaystyle\int p_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x}) \text{d}\mathbf{x}\).
- Although, \(\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}})\) is intractable
- Vincent showed that conditioning on $\mathbf{x}\sim p_{\text{data}}$ yields an equivalent tractable objective : Denoising Score Matching (DSM)
- Model)
- \(\mathcal{L}_{\text{DSM}}(\phi, \sigma) := \displaystyle\frac{1}{2}\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\; \tilde{\mathbf{x}}\sim p_\sigma(\cdot\mid\mathbf{x})} \left[ \Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) \Vert_2^2 \right]\).
- Props.)
- The optimal minimizer $s^*(\tilde{\mathbf{x}};\sigma)$ satisfies
- \(s^*(\tilde{\mathbf{x}};\sigma) = \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}})\).
- \(\sigma\rightarrow0\Rightarrow\begin{cases} p_\sigma(\tilde{\mathbf{x}}) \rightarrow p_{\text{data}}(\mathbf{x}) \\ s^*(\tilde{\mathbf{x}};\sigma) = \nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}) \rightarrow \nabla_{\mathbf{x}}\log p_\text{data}(\mathbf{x}) \end{cases}\).
- \(\mathcal{L}_{\text{DSM}}\) is equivalent to \(\mathcal{L}_{\text{SM}}\).
- i.e.) \(\exists C\in\mathbb{R} \text{ s.t. } \mathcal{L}_{\text{SM}}(\phi;\sigma) = \mathcal{L}_{\text{DSM}}(\phi;\sigma) + C\)
- Denoising Score Matching is equivalent to denoising.
- \(\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}\).
- Refer to Tweedie’s Formula below.
- \(\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}\).
- The optimal minimizer $s^*(\tilde{\mathbf{x}};\sigma)$ satisfies
- Sampling
- Langevin dynamics
- \(\tilde{\mathbf{x}}_{n+1} = \tilde{\mathbf{x}}_{n} + \eta \underbrace{s_{\phi^\times}(\tilde{\mathbf{x}}_{n})}_{\approx\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}_n)} + \sqrt{2\eta}\epsilon_n,\quad \epsilon_n\sim\mathcal{N}(\mathbf{0,I})\).
- Langevin dynamics
- Advantage)
- Well-defined gradients
- Injected noise provides full support in $\mathbb{R}^D$.
- cf.) Recall that data point is a low-dimensional manifold.
- Thus, the score function is well defined everywhere.
- Injected noise provides full support in $\mathbb{R}^D$.
- Improved coverage.
- Injected noise smooths out sparse regions between modes.
- This enhances the training signal quality and facilitates the sampling.
- Well-defined gradients
- Limit)
- Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
- Low Noise Level
- Langevin dynamics struggles to traverse modes in multi-modal distributions due to vanishing gradients in low-density regions.
- High Noise Level
- The model captures only coarse structures, resulting in blurry samples.
- Low Noise Level
- Langevin dynamics can be slow to converge or even fail in high-dimensional spaces.
- Single fixed noise $\sigma$ cannot capture the dynamics on the noise level.
E.g.) Additive Gaussian Noise DSM
- Model)
- Perturbation : $\tilde{\mathbf{x}} = \mathbf{x} + \sigma\epsilon,\quad \epsilon\sim\mathcal{N}(\mathbf{0, I})$
- Then
- \(p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\; \mathbf{x}, \sigma^2\mathbf{I})\).
- \(\nabla_{\tilde{\mathbf{x}}}\log p_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = \displaystyle\frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2}\).
- Thus, we may further simplify the loss as…
\(\begin{aligned} \mathcal{L}_{\text{DSM}}(\phi, \sigma) &= \frac{1}{2}\mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right] \\ &= \frac{1}{2}\mathbb{E}_{\mathbf{x}, \epsilon} \left[ \left\Vert s_\phi(\mathbf{x}+\sigma\epsilon;\sigma) + \frac{\epsilon}{\sigma} \right\Vert_2^2 \right] \\ \end{aligned}\).
Concept) Tweedie’s Formula
- Goal)
- We want to show that DSM is equivalent to denoising.
- Settings)
- Denoising
- We may define denoising as $\mathbf{x}\mid\tilde{\mathbf{x}}$.
- i.e.) getting the clean data $\mathbf{x}$ given the noisy data $\tilde{\mathbf{x}}$.
- The expected value of such values can be a reasonable estimate
- \(\mathbb{E}_{\mathbf{x}\sim p(\mathbf{x}\mid\tilde{\mathbf{x}})}[\mathbf{x}\mid\tilde{\mathbf{x}}]\).
- We may define denoising as $\mathbf{x}\mid\tilde{\mathbf{x}}$.
- Denoising
- Result)
- \(\underbrace{\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}]}_{\text{denoising}} = \displaystyle\frac{1}{\alpha} \underbrace{\left(\tilde{\mathbf{x}} + \sigma^2 \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)}_{\approx\text{DSM Langevin Sampling}}\).
- Meaning)
- Optimizing the DSM score and Langevin sampling from it is a perfect denoiser.
Derivation.
- Assume that the perturbation kernel (or conditional distribution of $\tilde{\mathbf{x}}$ given $\mathbf{x}$) is Gaussian. $$\displaystyle p(\tilde{\mathbf{x}}\mid\mathbf{x}) = \mathcal{N}(\tilde{\mathbf{x}};\; \alpha\mathbf{x}, \sigma^2\mathbf{I}) \varpropto \exp\left(-\frac{\Vert\tilde{\mathbf{x}} - \alpha\mathbf{x}\Vert^2}{2\sigma^2}\right)$$ --- cf.) Other exponential family distribution works thanks to the deferentiation rule below.- Then, we may get the gradient of the prior distribution as $$\displaystyle\nabla_{\tilde{\mathbf{x}}}\; p(\tilde{\mathbf{x}}\mid \mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\;p(\tilde{\mathbf{x}}\mid\mathbf{x}) \quad\cdots\quad (A)$$ - Now, consider that DSM score can be rewritten as $$\begin{aligned} \nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} p_\sigma(\tilde{\mathbf{x}}) \\ &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \nabla_{\tilde{\mathbf{x}}} \left( \int p(\tilde{\mathbf{x}}\mid\mathbf{x})p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \right) \\ &= \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \int \nabla_{\tilde{\mathbf{x}}} p(\tilde{\mathbf{x}}\mid\mathbf{x})p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \\ &= \int \frac{1}{p_\sigma(\tilde{\mathbf{x}})} \underbrace{\nabla_{\tilde{\mathbf{x}}} p(\tilde{\mathbf{x}}\mid\mathbf{x})}_{(A)} p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x} \\ &= -\int \frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\; \frac{p(\tilde{\mathbf{x}}\mid\mathbf{x}) p_{\text{data}}(\mathbf{x})}{p_\sigma(\tilde{\mathbf{x}})} \text{d}\mathbf{x} \\ &= -\int \frac{\tilde{\mathbf{x}} - \alpha\mathbf{x}}{\sigma^2}\; p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} & (\because\text{Bayes' Rule}) \\ &= - \frac{1}{\sigma^2} \left( \tilde{\mathbf{x}} \int p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} - \alpha \int \mathbf{x} p(\mathbf{x}\mid\tilde{\mathbf{x}}) \text{d}\mathbf{x} \right) \\ &= - \frac{1}{\sigma^2} \left( \tilde{\mathbf{x}} - \alpha \mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] \right) \\ \end{aligned}$$ - Above can be rewritten as $$\displaystyle\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] = \frac{1}{\alpha} \left( \tilde{\mathbf{x}} + \sigma^2\nabla_{\tilde{\mathbf{x}}} \log p_\sigma(\tilde{\mathbf{x}}) \right)$$
3.4 Multi-Noise Levels of Denoising Score Matching
Model) Noise Conditional Score Networks (NCSN)
Song et al 2019
- Idea)
- Recall that DSM’s single fixed noise scale $\sigma$ was a bottleneck.
- Instead,
- inject Gaussian noise at multiple levels into the data distribution
- jointly train a noise-conditional score network (NCSN) to estimate score functions across a range of noise scales.
- Training)
- Settings)
- \(0\lt\sigma_{1}\lt\sigma_{2}\lt\cdots\lt\sigma_{L}\) : a sequence of $L$ noise levels
- $\sigma_{1}$ is small enough to preserve most of the data’s fine details
- $\sigma_{L}$ is large enough to sufficiently smooth the distribution, fascilitating easier training.
- Perturbation Kernel : \(p_\sigma(\mathbf{x}_\sigma\mid\mathbf{x}) := \mathcal{N}(\mathbf{x}_\sigma;\; \mathbf{x}, \sigma^2\mathbf{I}\)
- Marginal Distribution : \(p_\sigma(\mathbf{x}_\sigma) = \displaystyle\int p_\sigma(\mathbf{x}_\sigma\mid\mathbf{x}) p_{\text{data}}(\mathbf{x})\text{d}\mathbf{x}\)
- \(0\lt\sigma_{1}\lt\sigma_{2}\lt\cdots\lt\sigma_{L}\) : a sequence of $L$ noise levels
- Loss
- \(\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)\),
- where
- \(\mathcal{L}_{\text{DSM}}(\phi;\sigma_i) = \frac{1}{2} \mathbb{E}_{\mathbf{x}, \tilde{\mathbf{x}}\mid\mathbf{x}} \left[ \left\Vert s_\phi(\tilde{\mathbf{x}};\sigma) - \frac{\mathbf{x}-\tilde{\mathbf{x}}}{\sigma^2} \right\Vert_2^2 \right]\).
- $\lambda(\sigma_i)\gt0$ : a weighting function for each scale
- where
- \(\mathcal{L}_{\text{NCSN}}(\phi) := \displaystyle\sum_{i=1}^L \lambda(\sigma_i) \mathcal{L}_{\text{DSM}}(\phi;\sigma_i)\),
- Result)
- \(s^*(\mathbf{x}_\sigma, \sigma) = \nabla_{\mathbf{x}_\sigma}\log p_\sigma(\mathbf{x}_\sigma)\).
- i.e.) Optimal score for each noise level $\sigma_i$
- Prop.)
- By Tweedie’s Formula, this is equivalent to the Bayes optimal noise predictor under DDPM loss.
- Refer to x-prediction.
- By Tweedie’s Formula, this is equivalent to the Bayes optimal noise predictor under DDPM loss.
- \(s^*(\mathbf{x}_\sigma, \sigma) = \nabla_{\mathbf{x}_\sigma}\log p_\sigma(\mathbf{x}_\sigma)\).
- Settings)
- Sampling)
- Annealed Langevin Dynamics
- cf.) Step Size
- Typically scaled by the noise level as
- \(\eta_\ell = \delta\cdot\frac{\sigma_\ell^2}{\sigma_1^2}\).
- Typically scaled by the noise level as
- cf.) Step Size
- Annealed Langevin Dynamics
- Prop.)
- Slow Sampling speed.
- $O(LK)$ where $L$ is the number of noise scales and $K$ is the iterative step in each scale during the annealed Langevin Dynamics.
- Slow Sampling speed.
Analysis) NCSN vs DDPM
Enjoy Reading This Article?
Here are some more articles you might like to read next: