Density Estimation Using Real NVP
hozy Summary
- Previous NICE was limited in expressiveness because its additive coupling method maintained a Jacobian determinant of 1 (Volume-preserving), which was not capable of contracting or expanding the data distributions flexibly.
- The Real NVP suggests the multi-scale architecture that solves this by:
- Introducing Affine coupling layers (scaling and translation) to construct a Non-Volume Preserving transformation.
- The scaling function \(s : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). in each layer replaces the re-scaling from NICE
- Performing a Squeezing operation at each scale, which trades spatial size for the number of channels.
- Utilizing checkerboard and channel-wise masking patterns to fully exploit the local correlation structure of images and increase the model’s expressiveness.
- Efficiently lowering the computational/memory cost by factoring out half of the variables directly to the latent space at regular intervals.
- Introducing Affine coupling layers (scaling and translation) to construct a Non-Volume Preserving transformation.
3 Model Definition
3.1 Change of variable formula
- Settings)
- \(x\in X\). : an observed data variable
- \(z\in Z\). : a latent variable
- where \(z\sim p_Z\).
- \(f:X\rightarrow Z\). : a bijection
- with \(g = f^{-1}\).
- Formula)
- \(p_X(x) = p_Z(f(x)) \displaystyle \left\vert\text{det}\left(\frac{\partial f(x)}{\partial x^\top}\right)\right\vert\).
- \(\log p_X(x) = \log p_Z(f(x)) + \log \displaystyle \left\vert\text{det}\left(\frac{\partial f(x)}{\partial x^\top}\right)\right\vert\).
- where \(\frac{\partial f(x)}{\partial x^\top}\). is the Jacobian of \(f\). at \(x\).
- Sampling)
- Draw \(z\sim p_Z\).
- Image in the original space goes \(g(z) = x\).
3.2 Coupling Layers
Concept) Affine Coupling Layer
- Settings)
- \(d\lt D\).
- \(x\in \mathbb{R}^D\). : the input
- \(y\in \mathbb{R}^D\). : the output
- Model)
- \(y_{1:d} = x_{1:d}\).
- \(y_{d+1:D} = x_{d+1:D} \odot \exp\left( s(x_{1:d}) \right) + t\left( x_{1:d} \right)\).
- where
- \(s : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). : scale
- \(t : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). : translation
- \(\odot\). : the Hadamard product
- where
- Graphical Desc.)
3.3 Properties
Tech.) Jacobian Transformation
- Formula)
- \(\displaystyle\frac{\partial y}{\partial x^\top} = \begin{bmatrix} \mathbf{I}_d & \mathbf{0} \\ \frac{\partial y_{d+1:D}}{\partial x^\top_{1:d}} & \text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right) \end{bmatrix}\).
- where
- \(\text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right)\). : a diagonal matrix which diagonal elements corresponds to the vector \(\exp\left[ s(x_{1:d}) \right]\)..
- \(s(x_{1:d}) \in\mathbb{R}^{D-d}\).
- where
- \(\displaystyle\frac{\partial y}{\partial x^\top} = \begin{bmatrix} \mathbf{I}_d & \mathbf{0} \\ \frac{\partial y_{d+1:D}}{\partial x^\top_{1:d}} & \text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right) \end{bmatrix}\).
- Determinant)
- \(\displaystyle\text{det}\left(\frac{\partial y}{\partial x^\top}\right) = \prod_{j=d+1}^{D} \exp\left[ s(x_{1:d})_j \right] = \exp\left[ \sum_{j=d+1}^D s(x_{1:d})_j \right]\).
- where \(s(x_{1:d})_j\). is the \(j\).-th element of the vector \(s(x_{1:d}) \in\mathbb{R}^{D-d}\).
- \(\displaystyle\text{det}\left(\frac{\partial y}{\partial x^\top}\right) = \prod_{j=d+1}^{D} \exp\left[ s(x_{1:d})_j \right] = \exp\left[ \sum_{j=d+1}^D s(x_{1:d})_j \right]\).
- Invertibility)
- Forward)
\(\begin{cases} y_{1:d} &= x_{1:d} \\ y_{d+1:D} &= x_{d+1:D} \odot \exp \left( s(x_{1:d}) \right) + t(x_{1:d}) \\ \end{cases}\). - Backward)
\(\begin{cases} x_{1:d} &= y_{1:d} \\ x_{d+1:D} &= (y_{d+1:D} - t(y_{1:d})) \odot \exp \left( -s(y_{1:d}) \right) + \\ \end{cases}\).
- Forward)
3.4 Masked convolution
- Formula)
- \(y = b\odot x + (1-b) \odot \Big( x \odot \exp \big( s(b\odot x) \big) + t(b\odot x) \Big)\).
- where
- \(b = \Big[ \underbrace{1\;\cdots\;1}_{1\le j \le d} \; \underbrace{0\;\cdots\;0}_{d+1\le j \le D} \Big] \in\{0,1\}^D\). for \(j=1,\cdots,D\).
- i.e.) Channel-wise Masking
- \(b = \Big[ \underbrace{1\;\cdots\;1}_{1\le j \le d} \; \underbrace{0\;\cdots\;0}_{d+1\le j \le D} \Big] \in\{0,1\}^D\). for \(j=1,\cdots,D\).
- where
- \(y = b\odot x + (1-b) \odot \Big( x \odot \exp \big( s(b\odot x) \big) + t(b\odot x) \Big)\).
Tech.) Spatial Checkerboard Pattern
- Desc.)
- The spatial checkerboard pattern mask has value 1 where the sum of spatial coordinates is odd, and 0 otherwise.
Tech.) Channel-wise Masking
- The channel-wise mask \(b\). is 1 for the first half of the channel dimensions and 0 for the second half.
3.5 Combining Coupling Layers
- Why needed?)
- As we can see from the Jacobian Transformation, the forward transformation leaves \(x_{1:d}\). unchanged.
- Thus, by composing layers in an alternating pattern, they can be updated as well.
- How?)
- Let
- \(a = \{1,\cdots,d\}\).
- \(b = \{d+1,\cdots,D\}\).
- Then we may compose two coupling layers of \(f_a\). and \(f_b\). as…
- \(\displaystyle\frac{\partial(f_b\circ f_a)}{\partial x_a^\top}(x_a) = \frac{\partial f_a}{\partial x_a^\top}(x_a) \cdot \frac{\partial f_b}{\partial x_b^\top}(x_b)\).
- where
- \(x_b = f_a(x_a)\).
- cf.) Recall \(f:\mathbb{R}^d \rightarrow\mathbb{R}^{D-d}\).
- \(x_b = f_a(x_a)\).
- where
- \(\displaystyle\frac{\partial(f_b\circ f_a)}{\partial x_a^\top}(x_a) = \frac{\partial f_a}{\partial x_a^\top}(x_a) \cdot \frac{\partial f_b}{\partial x_b^\top}(x_b)\).
- Let
- Inverse)
- \((f_b \circ f_a)^{-1} = f_a^{-1}\circ f_b^{-1}\).
3.6 Multi-scale Architecture
- Algorithm)
- Let
- \(x \in \mathbb{R}^{s\times s}\). : the input data
- \(L\). : the number of single layers + the final layer
- i.e.)
- the \(i=1,\cdots,L-1\).-th layers are the single layers
- the \(L\).-th layer is the final layer
- i.e.)
- Main function
-
reshape\(x \in \mathbb{R}^{s\times s \times c}\)., for \(c=1\). - \(i\leftarrow0,\quad h^{(i)}\leftarrow x\in \mathbb{R}^{s\times s \times c},\quad z=\Big[\underbrace{\text{null}\cdots\text{null}}_{L}\Big]\).
-
while\(s\gt 4\).- Apply the single layer transformation \(f^{i+1}\). at layer \(i+1\)..
- \(f^{(i+1)}\left( h^{(i)} \right)\).
- Then by the squeezing operation, \(f^{(i+1)}\left( h^{(i)} \right) \in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 4c}\).
- \(f^{(i+1)}\left( h^{(i)} \right)\).
- Halve the result into…
- \(z^{(i+1)} \leftarrow f^{(i+1)}\left( h^{(i)} \right)[:2c]\in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 2c}\).
- \(h^{(i+1)} \leftarrow f^{(i+1)}\left( h^{(i)} \right)[2c:]\in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 2c}\).
- \(z[i+1] \leftarrow z^{(i+1)}\).
- \(i \leftarrow i+1\).
- Apply the single layer transformation \(f^{i+1}\). at layer \(i+1\)..
- Apply the final layer transformation of \(f^L\). on \(h^{(L-1)}\). as…
- \(z^{(L)} \leftarrow f^{(L)}\left( h^{(L-1)} \right)\).
-
- Let
Tech.) Single-Scale Architecture
- Composition)
- 3 coupling layers of alternating checkerboard masks
- Perform a squeezing operation
- 3 coupling layers of alternating channel-wise masking
- To avoid the redundancy from the previous checkerboard masking.
Tech.) Squeezing Operation
- Desc.)
- Let \(A\in\mathbb{R}^{s\times s\times c}\). be the input data format
- Then the operation modifies the dimension by
- \(s\times s\times c \rightarrow \frac{s}{2}\times\frac{s}{2}\times 4c\).
- e.g.)
- \(4\times4\times1 \rightarrow 2\times 2\times 4\).
Tech.) Final -Scale Architecture
- Composition)
- Final 4 coupling layers of alternating checkerboard masks
3.7 Batch Normalization
- Motivation)
- Deep residual networks improve the propagation of training signal
- Desc.)
- Apply batch normalization to the whole coupling layer output.
- How?)
- Let \(N\). be the size of current the mini-batch
- Calculate the estimated batch statics of \(\tilde{\mu}\). and \(\tilde{\sigma}^2\).
- e.g.) \(\displaystyle\tilde{\mu} = \frac{1}{N}\sum_{j=1}^N y^j\).
- where \(y^j = \left[ y_{1:d}^j\;\; y_{d+1:D}^j] = [x_{1:d}^j\;\; x_{d+1:D}^j \odot \exp\left( s(x_{1:d}^j) \right) + t\left( x_{1:d}^j \right) \right]\).
- e.g.) \(\displaystyle\tilde{\mu} = \frac{1}{N}\sum_{j=1}^N y^j\).
- Rescale the input as \(\displaystyle x \leftarrow \frac{x-\tilde{\mu}}{\sqrt{\tilde{\sigma}^2 + \epsilon}}\).
- Then this rescaling function has a Jacobian determinant of
- \(\displaystyle\left(\prod_{j} \left( \tilde{\sigma}_i^2 + \epsilon \right) \right)^{-\frac{1}{2}}\).
- How?)
- Apply batch normalization to the whole coupling layer output.
- Moving Average)
- Why needed?)
- Due to the memory issue, we may have to set small mini-batch
- This results in the instability in training.
- By normalizing the current latent using the accumulated latents from the previous training, we may stabilize the mini-batch training.
- How?)
- Let
- \(\tilde{\mu}_t, \tilde{\sigma}_t^2\). be the layer statistics accumulated over \(1,\ldots,t-1\). mini-batches.
- \(\hat{\mu}_t, \hat{\sigma}_t^2\). be the current mini-batch statistics over \(j=1,\ldots,N\). datapoints in the \(t\).-th mini-batch
- Then the moving average statistics for the next layer can be derived as
- \(\tilde{\mu}_{t+1} = \rho \tilde{\mu}_t + (1-\rho)\hat{\mu}_t\).
- \(\tilde{\sigma}_{t+1}^2 = \rho \tilde{\sigma}_t^2 + (1-\rho)\hat{\sigma}_t^2\).
- Let
- Why needed?)
4 Experiments
4.1 Procedure
Tech.) Jittering Procedure
- Objective)
- Image data has the pixel values that typically lies in \([0,256]^D\)..
- This results in the boundary effect.
- i.e.)
- Generation will be made on the continuous distribution space, which is heterogeneous to the data distribution with the boundaries.
- i.e.)
- To reconcile this issue, authors add Uniform noise.
Enjoy Reading This Article?
Here are some more articles you might like to read next: