Denoising Diffusion Probabilistic Models (DDPM)
Ho et al. 2020
2. Background
Concept) Diffusion Model
- First introduced in…
- Sohl-Dickstein et al. 2015, Deep unsupervised learning using nonequilibrium thermodynamics.
- Goal)
Concept) Forward Process
- Goal)
- Gradually destroy the original data \(x_0\). until it becomes pure noise.
- Def.)
- A fixed Markov chain that gradually adds Gaussian noise to data as
- \(q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) = \displaystyle\mathcal{N}\left(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t I \right)\).
- where \(\beta_t\in\mathbb{R}\). is a scalar variance schedule parameter for step \(t\).
- \(q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) = \displaystyle\mathcal{N}\left(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t I \right)\).
- A fixed Markov chain that gradually adds Gaussian noise to data as
- Props.)
- No learnable parameters
- As \(t\to T\)., the sample \(x_t \rightarrow \mathbf{x}_T\sim\mathcal{N}(0,I)\). : the pure noise!
- \(\mathbf{x}_t = \displaystyle\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon\).
- where
- \(\alpha_t = 1-\beta_t\).
- \(\bar{\alpha}_t = \displaystyle\prod_{s=1}^t \alpha_s\).
- \(\epsilon\sim\mathcal{N}(0,I)\). : the Gaussian noise added to corrupt \(\mathbf{x}_0\). into \(\mathbf{x}_t\).
- where
Concept) Reverse Process
- Goal)
- Recover data from noise made by the forward process
- We may generate synthetic data using this.
- cf.) Recall inputting latent code \(\mathbf{z}\). in GAN
- Def.)
- A parameterized Markov chain that aims to invert the forward process and recover data from noise as
- \(p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))\).
- Desc.)
- The probability distribution of \(\mathbf{x}_{t-1}\mid\mathbf{x}_{t}\). is assumed to be Gaussian.
- And their moments are dependent on \(\mathbf{x}_{t}\).
- Desc.)
- \(p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))\).
- A parameterized Markov chain that aims to invert the forward process and recover data from noise as
- Props.)
- \(\theta\). : the learnable parameters
- Learned via minimizing a denoising objective:
- \(\mathbb{E}_{\mathbf{x}_0,\epsilon,t} \left[ \Vert \epsilon - \epsilon_\theta(\mathbf{x}_t, t) \Vert^2 \right]\).
- Why?)
- Recall from forward process that
- \(\mathbf{x}_t = \displaystyle\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon\).
- \(\epsilon\sim\mathcal{N}(0,I)\).
- Since \(\mathbf{x}_t\). is literally a mixture of the clean data \(\mathbf{x}_0\). and noise \(\epsilon\)., recovering \(\mathbf{x}_0\). is equivalent to recovering \(\epsilon\).
- where \(\epsilon\). is the true Gaussian noise sampled during the forward process
- Predicting \(\epsilon\). is easier and leads to a simpler loss function.
- Put \(\epsilon_\theta(\mathbf{x}_t, t)\).: the model’s prediction of that noise.
- Recall from forward process that
- Why?)
- \(\mathbb{E}_{\mathbf{x}_0,\epsilon,t} \left[ \Vert \epsilon - \epsilon_\theta(\mathbf{x}_t, t) \Vert^2 \right]\).
- Sampling starts from \(\mathbf{x}_T \sim \mathcal{N}(0,I)\). and progressively denoises until $x_0$.
Model) Diffusion Model
- \(p_\theta(\mathbf{x}_0) := \displaystyle\int p_\theta(\mathbf{x}_{0:T}) \text{d} \mathbf{x}_{1:T}\). : the latent variable model
- where
- \(\mathbf{x}_0\). : the data sample (e.g. pixel vector) s.t. \(\mathbf{x}_0\sim q(\mathbf{x}_0)\).
- where \(q\). is the ground truth distribution that data is generated from.
- \(p_\theta(\mathbf{x}_0)\). : the learned distribution
- cf.) We want \(p_\theta(x_0) \approx q(x_0)\).
- \(\mathbf{x}_1,\cdots,\mathbf{x}_T\). : the latents of the same dimensionality as the data \(\mathbf{x}_0\).
- \(p_\theta(\mathbf{x}_{0:T}) := p(\mathbf{x}_T)\displaystyle\prod_{t=1}^T p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\). : the reverse process
- i.e.) the joint distribution defined as a Markov chain with learned Gaussian transitions starting at \(p(\mathbf{x}_T) = \mathcal{N}(\mathbf{x}_T; \mathbf{0, I})\). : purely Gaussian so does not depend on \(\theta\).
- Desc.)
- Diffusion models generate data \(\mathbf{x}_0\). using the reverse Markov chain as
\(\begin{aligned} & x_T \sim \mathcal{N}(0,I) & \text{(pure Gaussian noise)} \\ \rightarrow& x_{T-1} \sim p_\theta(x_{T-1}\mid x_T) \\ \rightarrow& \cdots \\ \rightarrow& x_0 \sim p_\theta(x_0 \mid x_1) \\ \end{aligned}\).
- Diffusion models generate data \(\mathbf{x}_0\). using the reverse Markov chain as
- \(p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) := \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))\).
- \(\mathbf{x}_0\). : the data sample (e.g. pixel vector) s.t. \(\mathbf{x}_0\sim q(\mathbf{x}_0)\).
- where
- \(q(\mathbf{x}_{1:T}\mid\mathbf{x}_0) := \displaystyle\prod_{t=1}^T q(\mathbf{x}_t\mid \mathbf{x}_{t-1})\). : the forward process (diffusion process)
- where
- \(q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) := \displaystyle\mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I})\).
- where
- Loss Function)
- What we want is \(\log p_\theta(\mathbf{x}_0)\)..
- However, this is intractable due to the integral : \(p_\theta(\mathbf{x}_0) := \displaystyle\int p_\theta(\mathbf{x}_{0:T}) \text{d} \mathbf{x}_{1:T}\).
- Instead, just like the VAE, we may get the lower bound using the variational distribution as
\(\begin{aligned} \log p_\theta(\mathbf{x}_{0}) &= \log \int q(\mathbf{x}_{1:T}\mid \mathbf{x}_0) \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)} \text{d}\mathbf{x}_{1:T} \\ &\ge \mathbb{E}_{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)} \left[ \log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)} \right] & \because (\text{Jensen Inequality}) \end{aligned}\). - Taking the negative, we may get the loss function as
\(\begin{array}{lll} \mathcal{L} &= \displaystyle\mathbb{E}_q \left[ -\log\frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid \mathbf{x}_0)} \right] \\ &= \displaystyle\mathbb{E}_q \left[ -\log\frac{p(\mathbf{x}_T)\displaystyle\prod_{t=1}^T p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{\displaystyle\prod_{t=1}^T q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} \right] \\ &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=1}^T \log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} \right] \\ &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=2}^T \log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} - \log\frac{p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} \right] \\ \end{array}\). - Here, we want to add \(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\). term.
\(\begin{array}{lll} \mathcal{L} &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=2}^T \log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} \cdot \underbrace{\frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)}}_{\text{Posterior of }q} - \log\frac{p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} \right] \\ \end{array}\).- Why?)
- The model we want to train is \(p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\). in the reverse process.
- However, we don’t know what it looks like.
- Recall that we know the forward process \(q(\mathbf{x}_t\mid\mathbf{x}_{t-1})\)..
- Using the Bayes Rule, we may get
\(\begin{aligned} q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) &= \frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1},\mathbf{x}_0) q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_t\mid\mathbf{x}_0)} \\ &= \frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1}) q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_t\mid\mathbf{x}_0)} \\ \end{aligned}\). - Thus, we may train our reverse process by approximating \(p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) \approx q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\).
- The model we want to train is \(p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\). in the reverse process.
- Why?)
- Plugging in the value we derived from the Bayes Rule, we have
\(\begin{array}{lll} \mathcal{L} &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=2}^T \left(\log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_t\mid \mathbf{x}_{t-1})} \cdot\frac{\frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1},\mathbf{x}_0) q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_t\mid\mathbf{x}_0)}}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)}\right) - \log\frac{p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} \right] & \because \text{Bayes Rule} \\ &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=2}^T \left(\log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)} \cdot\frac{q(\mathbf{x}_t\mid\mathbf{x}_{t-1},\mathbf{x}_0) q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) q(\mathbf{x}_t\mid\mathbf{x}_0)}\right) - \log\frac{p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} \right] \\ &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=2}^T \left(\log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)} \cdot\frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_t\mid\mathbf{x}_0)}\right) - \log\frac{p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} \right] & \because q(\mathbf{x}_t\mid \mathbf{x}_{t-1}) = q(\mathbf{x}_t\mid\mathbf{x}_{t-1},\mathbf{x}_0) \\ &= \displaystyle\mathbb{E}_q \left[ -\log p(\mathbf{x}_T) -\sum_{t=2}^T \log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)} -\log\frac{q(\mathbf{x}_{1}\mid\mathbf{x}_0)}{q(\mathbf{x}_T\mid\mathbf{x}_0)} - \log\frac{p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}{q(\mathbf{x}_1\mid\mathbf{x}_0)} \right] & \displaystyle\because -\sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1}\mid\mathbf{x}_0)}{q(\mathbf{x}_t\mid\mathbf{x}_0)} = -\log\frac{q(\mathbf{x}_{1}\mid\mathbf{x}_0)}{q(\mathbf{x}_T\mid\mathbf{x}_0)} \\ &= \displaystyle\mathbb{E}_q \left[ -\log\frac{p(\mathbf{x}_T)}{q(\mathbf{x}_T\mid\mathbf{x}_0)} -\sum_{t=2}^T \log\frac{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)}{q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)} - \log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1) \right] \\ &= \displaystyle\mathbb{E}_q \left[ \underbrace{D_{KL}({q(\mathbf{x}_T\mid\mathbf{x}_0)}\Vert{p(\mathbf{x}_T)})}_{L_T} + \sum_{t=2}^T \underbrace{D_{KL}({q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)}\Vert{p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)})}_{L_{t-1}} - \underbrace{\log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}_{L_0} \right] \\ \end{array}\). - Using the prop. from the forward process that \(\mathbf{x}_t \mid \mathbf{x}_0 = \displaystyle\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon\)., we have
- \(q(\mathbf{x}_t\mid \mathbf{x}_0) =\mathcal{N}\left( \mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I} \right)\).
- In the previous loss’ \(L_{t-1}\)., we may get the closed form posterior (\(\because\). Gaussian)
- \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t\mathbf{I})\). : the forward process posterior
- where
- \(\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) := \displaystyle\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t} \mathbf{x}_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\mathbf{x}_t\).
- \(\displaystyle \tilde{\beta}_t := \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t\).
- How?)
- Using the Bayes Rule
- where
- Then, since \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\). and \(p_\theta (\mathbf{x}_{t-1} \mid \mathbf{x}_t)\). are both Gaussian, we may get the closed form expression of \(L_{t-1}\).
- \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t\mathbf{I})\). : the forward process posterior
3. Diffusion Models and denoising autoencoders
3.1 Forward Process and L_T
- Summary)
- Treat the approximate posterior \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\). to be fixed and have no parameter.
- Desc.)
- This paper ignores that the forward process variances \(\beta_t\). are learnable by reparameterization.
- Instead, it fix them to constants.
- Thus, the approximate posterior \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0)\). has no parameter.
- Why?) \(\alpha_t\). is made of \(\beta_t\).
3.2 Reverse Process and L_{1:T-1}
- Goal)
- Parameterize \(p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \Sigma_\theta(\mathbf{x}_t, t))\).
- Moments)
- Variance Parameterization \((\boldsymbol{\Sigma}_\theta)\).
- Def.)
- \(\displaystyle\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \sigma_t^2 \mathbf{I}\).
- Options)
- \(\displaystyle \sigma_t^2 = \tilde{\beta}_t\). : the posterior variance of \(q\).
- Result)
- Optimal for \(\mathbf{x}_0\sim\mathcal{N}(\mathbf{0, I})\).
- Result)
- \(\displaystyle \sigma_t^2 = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_t\).
- Result)
- Optimal for \(\mathbf{x}_0\). deterministically set to one point
- Result)
- \(\displaystyle \sigma_t^2 = \tilde{\beta}_t\). : the posterior variance of \(q\).
- Def.)
- Mean Parameterization \((\mu_\theta)\).
- \(\begin{aligned} \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) &= \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t, \; \frac{1}{\sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t)) \right) \\ &= \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) \end{aligned}\).
- where \(\boldsymbol{\epsilon}_\theta\). is a function approximator intended to predict \(\boldsymbol{\epsilon}\). from \(\mathbf{x}_t\).
- Derivation)
- Choosing the variance as \(\displaystyle\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \sigma_t^2 \mathbf{I}\)., we have
- \(p_\theta(\mathbf{x}_{t-1}\mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \mu_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})\).
- Then, we may rewrite \(L_{t-1}\). as
- \(L_{t-1} = \displaystyle\mathbb{E}_q\left[ \frac{1}{2\sigma_t^2} \Vert \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \Vert^2 \right] + C\).
- where \(C\). is a constant that does not depend on \(\theta\).
- \(L_{t-1} = \displaystyle\mathbb{E}_q\left[ \frac{1}{2\sigma_t^2} \Vert \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \Vert^2 \right] + C\).
- Also, from \(q(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t\mathbf{I})\)., we may reparameterize as
- \(\displaystyle\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}) = \sqrt{\bar{\alpha}} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}\). for \(\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0, I})\).
- Thus, we have
\(\begin{aligned} L_{t-1} - C &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \frac{1}{2\sigma_t^2} \left\Vert \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}), \frac{1}{\sqrt{\bar{\alpha}}_t}(\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}) - \sqrt{1-\bar{\alpha}} \boldsymbol{\epsilon}) \right) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \right\Vert^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \frac{1}{2\sigma_t^2} \left\Vert \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}), \frac{1}{\sqrt{\bar{\alpha}}_t}(\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}) - \sqrt{1-\bar{\alpha}} \boldsymbol{\epsilon}) \right) - \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) \right\Vert^2 \right] \\ \end{aligned}\).
- Choosing the variance as \(\displaystyle\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \sigma_t^2 \mathbf{I}\)., we have
- \(\begin{aligned} \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) &= \tilde{\boldsymbol{\mu}}_t \left( \mathbf{x}_t, \; \frac{1}{\sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t)) \right) \\ &= \frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right) \end{aligned}\).
- Variance Parameterization \((\boldsymbol{\Sigma}_\theta)\).
- Loss Function)
- Again from the above derivation, we have
\(\begin{aligned} L_{t-1} - C &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \frac{1}{2\sigma_t^2} \left\Vert \underbrace{\frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon} \right)}_{\text{Target posterior mean}} - \underbrace{\boldsymbol{\mu}_\theta(\mathbf{x}_t(\mathbf{x}_0, \boldsymbol{\epsilon}), t)}_{\text{Model mean}} \right\Vert^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \frac{1}{2\sigma_t^2} \left\Vert \underbrace{\frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon} \right)}_{\text{Target posterior mean}} - \underbrace{\frac{1}{\sqrt{\bar{\alpha}_t}} \left( \mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right)}_{\text{Model mean}} \right\Vert^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \frac{1}{2\sigma_t^2} \left\Vert \frac{\beta_t}{\sqrt{\bar{\alpha}_t}\sqrt{1-\bar{\alpha}_t}} \left( \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) - \boldsymbol{\epsilon} \right) \right\Vert^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \frac{\beta_t^2}{2\sigma_t^2 \bar{\alpha}_t (1-\bar{\alpha}_t)} \left\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) \right\Vert^2 \right] \\ &= \mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \bar{\alpha}_t (1-\bar{\alpha}_t)}}_{\text{all fixed at fwd process}} \left\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t) \right\Vert^2 \right] \\ \end{aligned}\).
- Again from the above derivation, we have
- Training)
- Algorithm)
- Uses the simplifed loss below
- Sampling)
- Algorithm)
3.3 Data scaling, reverse process decoder, and L_0
Concept) Gaussian Discretization
- Problem)
- To optimize \(L_0\)., we need to calculate the likelihood \(p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)\).
- However, the given original image data \(\mathbf{x}_0\). is in the discrete format.
- Why?)
- Consider that an image data that each channel of a pixel has the value of in range \(0\sim255\)..
- Why)
- Recall that in each pixel, there are three channels, RGB.
- And each channel has the value in range \(0\sim255\)..
- Why)
- Thus, Each pixel consists of 3 channels (R, G, B), and each channel has 256 discrete states.
- Thus, its likelihood should be calculated as
- \(p_\theta(\mathbf{x}_0^i \mid \mathbf{x}_1) = \text{Softmax}(Wh^i + b)[\mathbf{x}_0^i],\quad \mathbf{x}_0^i\in\{0,1,\cdots,255\}\).
- i.e.) 256 categorical
- \(p_\theta(\mathbf{x}_0^i \mid \mathbf{x}_1) = \text{Softmax}(Wh^i + b)[\mathbf{x}_0^i],\quad \mathbf{x}_0^i\in\{0,1,\cdots,255\}\).
- Consider that an image data that each channel of a pixel has the value of in range \(0\sim255\)..
- Why?)
- If we categorize them and get probability distribution independently, …
- it’s computationally expensive (256-way softmax for each channel of each pixel!)
- Massive computations for \(D\). times
- where \(D = \text{(height)}\times\text{(width)}\times 3\).
- Massive computations for \(D\). times
- it’s hard to propagate gradients, and have high variance
- its sampling will also be discrete
- it’s computationally expensive (256-way softmax for each channel of each pixel!)
- Idea) Discretization
- Instead of predicting a categorical distribution, use a continuous Gaussian and compute the probability mass assigned to the correct bin.
- For example, if the R(red) value of the 56-th pixel is 128,
- evaluate $\displaystyle\int_{127.5}^{128.5} \mathcal{N}(x;\,\mu_\theta^i,\sigma^2) dx$.
- How?)
- In normalized form, the channel range $[-1,1]$ is partitioned into 256 bins:
- e.g.) the \(k\).-th bin would be \(\displaystyle\left[\frac{k}{256}-\frac{1}{256},\; \frac{k}{256}+\frac{1}{256}\right]\).
- Treat them as one dimensional data and get the Gaussian distribution as
- \(p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) = \displaystyle\prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)} \mathcal{N}(x;\; \mu_\theta^i(\mathbf{x}_{t}, t), \sigma_t^2) \;\text{d} x\).
- where
\(\begin{aligned} \delta_{+}(x_0^i) &= \begin{cases} \infty & \text{if } x=1 \\ x+\frac{1}{255} & \text{if } x\lt 1 \end{cases} \\ \delta_{-}(x_0^i) &= \begin{cases} -\infty & \text{if } x=-1 \\ x-\frac{1}{255} & \text{if } x\gt -1 \end{cases} \\ \end{aligned}\).- \(D\). is the dimension of the data (e.g. \(\text{(height)}\times\text{(width)}\times 3\).)
- Why 3?) RGB
- \(i\). indicates extraction of one coordinate.
- \(D\). is the dimension of the data (e.g. \(\text{(height)}\times\text{(width)}\times 3\).)
- where
- Interpretation)
- The range \(\left[\delta_{-}(x_0^i), \delta_{+}(x_0^i)\right]\). corresponds with the discrete value (e.g. \(128\in[0,255]\).)
- \(\displaystyle\int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)} \mathcal{N}(x;\; \mu_\theta^i(\mathbf{x}_{t}, t), \sigma_t^2) \;\text{d} x\). is the probability mass in that range.
- Think of the value as the probability that the Gaussian exists in that range.
- And, that Gaussian is determined by the moments provided from the previous step \(\mu_\theta(\mathbf{x}_t,t)\).
- It’s just a trick to fit discrete value in to continuous setting.
- \(p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) = \displaystyle\prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)} \mathcal{N}(x;\; \mu_\theta^i(\mathbf{x}_{t}, t), \sigma_t^2) \;\text{d} x\).
- In normalized form, the channel range $[-1,1]$ is partitioned into 256 bins:
- Prop.)
- Similar to VAE decoders, and AR models
- This distribution ensures that the variational bound (ELBO) is a lossless codelength of discrete data.
- Why codelength?)
- Recall that MLE \(\left(\displaystyle\arg\min_\theta -\log p(\theta)\right)\). is finding \(\theta\). that minimizes the code length in information theory.
- Consider that we modified the discrete data into the continuous one.
- Thus, there can be a chance that the discrete data is lost.
- However, the discretized likelihood guarantees that no data is lost and the value that ELBO is the actual lossless code length.
- Why codelength?)
- No need to
- add noise to the data
- incorporate the Jacobian of the scaling operation into the log likelihood
- This distribution ensures that the variational bound (ELBO) is a lossless codelength of discrete data.
- Similar to VAE decoders, and AR models
Concept) Reverse Process Decoder
- Starting point
- \(p(\mathbf{x}_T) \sim \mathcal{N}(\mathbf{0,I})\). : the standard normal prior
- Last part
- \(p_\theta(\mathbf{x}_{0}\mid\mathbf{x}_1) = \displaystyle\prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)} \mathcal{N}(x;\; \mu_\theta^i(\mathbf{x}_{1}, 1), \sigma_1^2) \;\text{d} x\).
- Prop.)
- At the end of sampling, it displays \(\mu_\theta(\mathbf{x}_1, 1)\). noiselessly.
3.4 Simplified training object
- Def.)
- \(L_{\text{simple}}(\theta) := \displaystyle\mathbb{E}_{t,\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t) \right\Vert^2 \right]\).
- where
- \(t\sim\text{Uniform}[1,T]\).
- where
- \(L_{\text{simple}}(\theta) := \displaystyle\mathbb{E}_{t,\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \left\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta (\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t) \right\Vert^2 \right]\).
- Props.)
- \(t=1\).
- \(L_0\). with the integral in the discrete decoder definition (last part) approximated by the Gaussian pdf times the bin width, ignoring \(\sigma_1^2\). and edge effects.
- \(t\gt1\).
- Unweighted version of
- \(\mathbb{E}_{\mathbf{x}_0, \boldsymbol{\epsilon}} \left[ \underbrace{\frac{\beta_t^2}{2\sigma_t^2 \bar{\alpha}_t (1-\bar{\alpha}_t)}}_{\text{all fixed at fwd process}} \left\Vert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \boldsymbol{\epsilon}, t) \right\Vert^2 \right]\).
- Unweighted version of
- \(t=1\).
4. Experiments
- Settings)
- \(T = 1000\).
- Linear noise schedule \(\beta_t\).
- \(\beta_1 = 10^{-4}\).
- \(\beta_T = 0.02\).
Implementation
- Network Model
- Input
- \(\mathbf{x}_t\). : the forwarded image
- Recall that \(\beta_t\). is given and \(\epsilon\sim\mathcal{N}(\mathbf{0,I})\). is sampled.
- Then, we may generate \(\mathbf{x}_t\). using them.
- \(t\). : the time stamp
- This will go thought the embedding such as sinusoidal embedding.
- Why?)
- To inject the integer \(t\). into the CNN such as UNET
- Why?)
- This will go thought the embedding such as sinusoidal embedding.
- \(\mathbf{x}_t\). : the forwarded image
- Input
Model)
- e.g.) UNET
- Use convolution to get the global features.
- Upsample and mix with the cropped copy
- Output
- \(\epsilon_\theta(\mathbf{x}_t, t)\).
- Loss
- The difference between \(\epsilon\). and \(\epsilon_\theta(\mathbf{x}_t, t)\).
- Training
- Minimize the loss using the backprop.
- Sampling
- Input the complete noise \(\mathbf{x}_T\)..
- Then the reverse process will generate the image.
Enjoy Reading This Article?
Here are some more articles you might like to read next: