Normalizing Flows are Capable of Generative Models (Tarflow)
2 Method
2.1 Normalizing Flows
- Settings)
- \(x\in\mathbb{R}^D\). : the input dataset
- where
- \(x\sim p_{\text{data}}\).
- where
- \(p_0\). : the prior density
- This paper uses \(p_0 = \mathcal{N}(0, \mathbf{I}_D)\).
- \(x\in\mathbb{R}^D\). : the input dataset
- Model)
- \(p_{\text{model}}\). : the model density
- i.e.) \(\displaystyle p_{\text{model}} = p_0(f(x)) \; \left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\).
- where
- \(f:\mathbb{R}^D\rightarrow\mathbb{R}^D\). : the transformation
- where
- i.e.) \(\displaystyle p_{\text{model}} = p_0(f(x)) \; \left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\).
- \(p_{\text{model}}\). : the model density
- Optimization)
- MLE objective of
- \(\displaystyle\min_f - \log p_0(f(x)) - \log\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\).
- With the Gaussian prior \(p_0\)., we may rewrite as
- \(\displaystyle\min_f \log \frac{1}{2}\Vert f(x) \Vert_2^2 - \log\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\).
- Applying the AR flow transformation below, we may get
- \(\displaystyle\min_f \frac{1}{2}\Vert z^T \Vert_2^2 + \sum_{i=1}^{N-1} \sum_{j=1}^{D-1} \alpha_i^t(\tilde{z}_{\lt i}^t)\).
- MLE objective of
- Result)
- A generative model of
- \(x = f^{-1}(z)\).
- where \(z=f(x)\sim p_0(z)\). is the latent variable
- \(x = f^{-1}(z)\).
- A generative model of
2.2 Block Autoregressive Flows
- Desc.)
- Stacking multiple layers of AR flows
- cf.)
- IAF (Kingma et al. 2016)
- MAF (Papamakarios et al. 2017)
- cf.)
- Stacking multiple layers of AR flows
- Settings)
- \(x\in\mathbb{R}^{N\times D}\). : the input sequence data with length \(N\). and dimension \(D\).
- \(i\). : the subscript for indexing along the sequential dimension \(\mathbb{R}^D\).
- i.e.) \(x_i\in\mathbb{R}^D,\quad i=1,2,\ldots,N\).
- \(i\). : the subscript for indexing along the sequential dimension \(\mathbb{R}^D\).
- \(T\in\mathbb{N}\). : the number of flow layers in the stack of flows
- \(\{\pi^t\}\;(t=1,\ldots,T)\). : any fixed set of permutation functions along the sequence dimension \(\mathbb{R}^D\).
- cf.) In this paper,
- \(\pi^t(z)_i = \begin{cases} z_{N-1-i} & (\text{Reverse function}) & t\gt0 \\ z_i & (\text{Identity funcition}) & t=0 \end{cases}\).
- cf.) In this paper,
- \(x\in\mathbb{R}^{N\times D}\). : the input sequence data with length \(N\). and dimension \(D\).
Model) Flow Transformation
- Def.)
- \(z^T = f(x) := (f^{T-1}\circ f^{T-2}\circ\cdots f^{0})(x)\).
- where
- \(f^t\). is parameterized by two learnable functions \(\mu^t, \alpha^t : \mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{N\times D}\).
- cf.) Both causal along the sequence dimension
- i.e.) Causality : Refer to \(\mu_i^t(\tilde{z}_{\lt i}^t)\). and \(\alpha_i^t(\tilde{z}_{\lt i}^t)\). below.
- cf.) Both causal along the sequence dimension
- \(f^t\). is parameterized by two learnable functions \(\mu^t, \alpha^t : \mathbb{R}^{N\times D}\rightarrow\mathbb{R}^{N\times D}\).
- Dynamics
- Input
- \(z^t = \{z_i^t\}_{i\in\{1,\ldots,N\}} \in\mathbb{R}^{N\times D}\).
- Output
- \(z^{t+1} = \{z_i^{t+1}\}_{i\in\{1,\ldots,N\}} \in\mathbb{R}^{N\times D}\).
- where
- \(z_i^{t+1} = \begin{cases} \tilde{z}_i^t & i=0 \\ (\tilde{z}_i^t - \mu_i^t(\tilde{z}_{\lt i}^t)) \odot \exp(-\alpha_i^t(\tilde{z}_{\lt i}^t)) & i\gt0 \end{cases}\).
- \(\tilde{z}^t = \pi^t(z^t)\).
- where
- \(z^{t+1} = \{z_i^{t+1}\}_{i\in\{1,\ldots,N\}} \in\mathbb{R}^{N\times D}\).
- Input
- where
- \(z^T = f(x) := (f^{T-1}\circ f^{T-2}\circ\cdots f^{0})(x)\).
- Prop.)
- Causality of \(\mu^t\). and \(\alpha^t\).
- \(\mu_i^t(\tilde{z}_{\lt i}^t)\).
- \(\alpha_i^t(\tilde{z}_{\lt i}^t)\).
- Inverse
- Input
- \(z^{t+1}\).
- Output
- \(z^t\).
- where
- \(z^t = (\pi^t)^{-1}(\tilde{z}^t)\).
- \(\tilde{z}_i^{t} = \begin{cases} z_i^{t+1} & i=0 \\ (z_i^{t+1}) \odot \exp(\alpha_i^t(\tilde{z}_{\lt i}^t)) + \mu_i^t(\tilde{z}_{\lt i}^t) & i\gt0 \end{cases}\).
- where
- \(z^t\).
- Input
- \(D\). plays a role of balancing the difficulty of modeling each position in the sequence and the length of the entire sequence.
- why?)
- Consider that \(N\times D\). is fixed.
- e.g.) \(64\times 64\).-sized pixel image
- Putting the patch size of \(8\times 8 = D\). pixels, we may get the \(8^2=64\). length of sequence.
- e.g.) \(64\times 64\).-sized pixel image
- Consider that \(N\times D\). is fixed.
- why?)
- Jacobian Determinant Calculation
- Since we consider the maximum log-likelihood, we consider the log determinant.
- \(\pi^t\). has 0 log determinant.
- why?)
- \(\pi^t\). is volume preserving \(\Rightarrow\). det is 1 \(\Rightarrow\). log-det is 0.
- why?)
- The AR step has the log Jacobian Determinant of…
- \(\displaystyle \log \left\vert \text{det}\left(\frac{\partial f^t(z^t)}{\partial z^t}\right) \right\vert = -\sum_{i=1}^{N-1} \sum_{j=1}^{D-1} \alpha_i^t(\tilde{z}_{\lt i}^t)\).
- Prop.) Sum of linear terms
- \(\displaystyle \log \left\vert \text{det}\left(\frac{\partial f^t(z^t)}{\partial z^t}\right) \right\vert = -\sum_{i=1}^{N-1} \sum_{j=1}^{D-1} \alpha_i^t(\tilde{z}_{\lt i}^t)\).
- Then, we may get the loss of…
\(\begin{aligned} \min_f &- \log \frac{1}{2}\Vert f(x) \Vert_2^2 - \log\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert \\ &= \frac{1}{2}\Vert z^T \Vert_2^2 + \log \left\vert \text{det}\left(\frac{\partial f^t(z^t)}{\partial z^t}\right) \right\vert + \sum_{i=1}^{N-1} \sum_{j=1}^{D-1} \alpha_i^t(\tilde{z}_{\lt i}^t) \end{aligned}\).
- Causality of \(\mu^t\). and \(\alpha^t\).
2.3 Transformer Autoregressive Flows
Concept) Parallel Implementation
- Desc.)
- Following structure enables the parallel implementation.
- \(z^{t+1} = \{z_i^{t+1}\}_{i\in\{1,\ldots,N\}} \in\mathbb{R}^{N\times D}\).
- where
- \(z_i^{t+1} = \begin{cases} \tilde{z}_i^t & i=0 \\ (\tilde{z}_i^t - \mu_i^t(\tilde{z}_{\lt i}^t)) \odot \exp(-\alpha_i^t(\tilde{z}_{\lt i}^t)) & i\gt0 \end{cases}\).
- \(\tilde{z}^t = \pi^t(z^t)\).
- where
- \(z^{t+1} = \{z_i^{t+1}\}_{i\in\{1,\ldots,N\}} \in\mathbb{R}^{N\times D}\).
- Transformer backbone
- Following structure enables the parallel implementation.
- e.g.)
- Refer to the image case below
Tech.) Dimensionality and Modality
- Desc.)
- Recall that we had two dimensions
- \(N\). : the sequence length
- \(D\). : the dimension of a block that forms the sequence
- Recall that we had two dimensions
- Application)
- Image
- Initial Dimension)
- \(C\times W\times H\).
- Setting)
- Let \(S\). be the patch length.
- i.e.) \(S\times S\). is the patch size
- Let \(S\). be the patch length.
- Conversion)
- \(N = \displaystyle\frac{HW}{S^2}\).
- \(D = CS^2\).
- Advantage)
- Readily applicable to the Vision Transformer (ViT)
- Apply causality mask to transformation of a single AR pass \(f^t\).
- Initial Dimension)
- Image
Analysis) Train Stability
- Arg.)
- Tarflow enables stable training.
- Why?)
- Tarflow is a variant of a Residual Network (ResNet).
- Two types of ResNets
- Hidden layers inside the causal transformer.
- cf.) Recall the residual connection between the self-attention and the MLP layer
- Latents \(z_i^t\).
- why?)
- Recall that the forward dynamics goes
- \(z_i^{t+1} = \begin{cases} \tilde{z}_i^t & i=0 \\ (\tilde{z}_i^t - \mu_i^t(\tilde{z}_{\lt i}^t)) \odot \exp(-\alpha_i^t(\tilde{z}_{\lt i}^t)) & i\gt0 \end{cases}\).
- Here, the permutation \(\pi\). is volume preserving and does not destroy the residual connection.
- \(\mu_i^t(\tilde{z}_{\lt i}^t)\). and \(\alpha_i^t(\tilde{z}_{\lt i}^t)\). are the affine transformation coefficients of \(\tilde{z}_i^t\).
- Recall that the forward dynamics goes
- why?)
- Hidden layers inside the causal transformer.
2.4 Noise Augmented Training
- Desc.)
- Recall that real NVP added the Uniform noise to the data.
- This is equivalent to the dequantizing the discrete pixel distribution to a continuous Uniform distribution.
- Notation)
- Let \(p_{\text{data}}\). be the actual data distribution.
- Then, we model a noise augmented distribution \(\displaystyle q(y) = \int_\epsilon p_{\text{data}}(y-\epsilon) p_\epsilon(\epsilon) \text{d}\epsilon\).
- cf.) Finite data case : \(\displaystyle q(y) = \frac{1}{\vert\mathcal{X}\vert} \sum_{x\in\mathcal{X}} p_\epsilon(y-x)\).
- We may get the likelihood as
- \(\displaystyle\tilde{p}(\tilde{x}) = \int_{\epsilon\in[0,\text{bin}]^D}\;p_{\text{model}}(\tilde{x}+\epsilon)\text{d}\epsilon\).
- where \(\text{bin}\).
- \(\displaystyle\tilde{p}(\tilde{x}) = \int_{\epsilon\in[0,\text{bin}]^D}\;p_{\text{model}}(\tilde{x}+\epsilon)\text{d}\epsilon\).
- Why needed?)
- Input data is in the discrete value.
- Thus, the model learns the transformation between the discrete input distribution and the prior distribution.
- Hence, the generation by the inverse model shows poor quality.
- To the model, continuous data space is the out of distribution!
- Meanwhile, adding noise to the data can enrich the support of the training distribution.
- Here, the authors add Gaussian noise instead of the Uniform noise.
- i.e.) \(\mathcal{N}(\cdot\;;\; 0, \sigma^2 \mathbf{I})\).
- e.g.) For the image pixel with values \([-1,1]\)., an optimal \(\sigma\). of \(p_\epsilon(\cdot)\). for sample quantity is around \(0.05\)..
- Why?)
- Gaussian has denser input distribution
- probability density throughout \((-\infty, \infty)\).
- Gaussian has denser input distribution
- Recall that real NVP added the Uniform noise to the data.
- Problem)
- Model trained on the noisy distribution \(q(y)\). naturally generates outputs that mimic noisy training examples.
- Thus, the samples are less visually appealing.
- cf.) Real NVP injected relatively small noise. Thus, the generated sample is not that much noisy but the OOD problem is not properly handled.
- Sol.)
- Score Based Denoising below
2.5 Score Based Denoising
- Motivation)
- Samples generated from the noise augmented training is too noisy.
- Idea)
- Consider the joint distribution \((x,y)\). s.t.
- \(x\sim p_{\text{data}}\). : original data
- \(y = x + \epsilon\). for \(\epsilon\sim\mathcal{N}(0, \sigma^2\mathbf{I})\). : noise injected data
- with distribution \(y\sim q(y)\).
- Then, by definition, \(y\). is marginally distributed as the noisy data distribution \(q\).
- where \(\displaystyle q(y) = \int_\epsilon p_{\text{data}}(y-\epsilon) p_\epsilon(\epsilon) \text{d}\epsilon\).
- By Tweedie’s formula, we have
- \(\mathbb{E}[x\mid y] = y + \sigma^2\nabla_y \log q(y)\).
- Desc.)
- Recall that the Gaussian distribution has the highest density at its origin ($\epsilon=0$).
- Thus, $\nabla_y \log q(y)$ (the score) acts as a compass pointing toward the density center
- i.e.) the “origin” where the noise is zero.
- Crucially, this “origin” in the noise space corresponds to the true data location $x$ in the data space ($x = y - \epsilon$).
- Therefore, the score $\nabla_y \log q(y)$ shows the direction toward the clean data manifold $p_{data}$.
- Magnitude & Velocity)
- The variance $\sigma^2$ functions as the magnitude (or step size) required to traverse back to the origin.
- Hence, the term $\sigma^2 \nabla_y \log q(y)$ can be interpreted as the correction velocity that shifts the noisy sample $y$ back into the clean sample $x$.
- \(y + \sigma^2 \nabla_y \log q(y) \approx \mathbb{E}[x\mid y]\).
- Desc.)
- \(\mathbb{E}[x\mid y] = y + \sigma^2\nabla_y \log q(y)\).
- Thus, if we know \(\nabla_y \log q(y)\)., the gradient of log likelihood \(\log q(y)\)., we may denoise into
- \(\hat{x}:=\mathbb{E}[x\mid y]\).
- Now, suppose our model is well trained and achieved the likelihood of \(p_{\text{model}}(y)\)..
- Then, we might replace \(q(y)\). with \(p_{\text{model}}(y)\)..
- Consider the joint distribution \((x,y)\). s.t.
- Procedure)
- Draw latent from the prior.
- \(z\sim p_0\)..
- Generate a sample using the inverse transformation.
- \(y := f^{-1}(z)\).
- Score Based Denoising
- \(x := y + \sigma^2\nabla_y \log p_{\text{model}}(y)\).
- Draw latent from the prior.
2.6 Guidance
- Method) Classifier Free Guidance (CFG)
- Settings)
- Class conditional predictions
- \(\mu_i^t(\cdot; c), \alpha_i^t(\cdot; c)\).
- where \(c\). is the condition
- \(\mu_i^t(\cdot; c), \alpha_i^t(\cdot; c)\).
- Unconditional predictions
- \(\mu_i^t(\cdot; \emptyset), \alpha_i^t(\cdot; \emptyset)\).
- Implementation)
- Randomly drop out the class label during training
- Implementation)
- \(\mu_i^t(\cdot; \emptyset), \alpha_i^t(\cdot; \emptyset)\).
- Class conditional predictions
- Modify the reverse function into
- \(\tilde{z}_i^t = \tilde{z}_i^{t+1} \odot \exp\left( \tilde{\alpha}_i^t(\tilde{z}_{\lt i}^t; c, w) \right) + \tilde{\mu}_i^t(\tilde{z}_{\lt i}^t; c, w)\)..
- where
- \(\tilde{\mu}_i^t(\tilde{z}_{\lt i}^t; c, w) = (1+w){\mu}_i^t(\tilde{z}_{\lt i}^t; c, w) - w {\mu}_i^t(\tilde{z}_{\lt i}^t; \emptyset)\).
- \(\tilde{\alpha}_i^t(\tilde{z}_{\lt i}^t; c, w) = (1+w){\alpha}_i^t(\tilde{z}_{\lt i}^t; c, w) - w {\alpha}_i^t(\tilde{z}_{\lt i}^t; \emptyset)\).
- where
- \(\tilde{z}_i^t = \tilde{z}_i^{t+1} \odot \exp\left( \tilde{\alpha}_i^t(\tilde{z}_{\lt i}^t; c, w) \right) + \tilde{\mu}_i^t(\tilde{z}_{\lt i}^t; c, w)\)..
- Settings)
3 Experiments
Enjoy Reading This Article?
Here are some more articles you might like to read next: