Density Estimation Using Real NVP

hozy Summary

Previous NICE was limited in expressiveness because its additive coupling method maintained a Jacobian determinant of 1 (Volume-preserving), which was not capable of contracting or expanding the data distributions flexibly.
The Real NVP suggests the multi-scale architecture that solves this by:
- Introducing Affine coupling layers (scaling and translation) to construct a Non-Volume Preserving transformation.
  - The scaling function \(s : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). in each layer replaces the re-scaling from NICE
- Performing a Squeezing operation at each scale, which trades spatial size for the number of channels.
- Utilizing checkerboard and channel-wise masking patterns to fully exploit the local correlation structure of images and increase the model’s expressiveness.
- Efficiently lowering the computational/memory cost by factoring out half of the variables directly to the latent space at regular intervals.

3 Model Definition

3.1 Change of variable formula

Settings)
- \(x\in X\). : an observed data variable
- \(z\in Z\). : a latent variable
  - where \(z\sim p_Z\).
- \(f:X\rightarrow Z\). : a bijection
  - with \(g = f^{-1}\).
Formula)
- \(p_X(x) = p_Z(f(x)) \displaystyle \left\vert\text{det}\left(\frac{\partial f(x)}{\partial x^\top}\right)\right\vert\).
- \(\log p_X(x) = \log p_Z(f(x)) + \log \displaystyle \left\vert\text{det}\left(\frac{\partial f(x)}{\partial x^\top}\right)\right\vert\).
  - where \(\frac{\partial f(x)}{\partial x^\top}\). is the Jacobian of \(f\). at \(x\).
Sampling)
- Draw \(z\sim p_Z\).
- Image in the original space goes \(g(z) = x\).

3.2 Coupling Layers

Concept) Affine Coupling Layer

Settings)
- \(d\lt D\).
- \(x\in \mathbb{R}^D\). : the input
- \(y\in \mathbb{R}^D\). : the output
Model)
- \(y_{1:d} = x_{1:d}\).
- \(y_{d+1:D} = x_{d+1:D} \odot \exp\left( s(x_{1:d}) \right) + t\left( x_{1:d} \right)\).
  - where
    - \(s : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). : scale
    - \(t : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). : translation
    - \(\odot\). : the Hadamard product
Graphical Desc.)

3.3 Properties

Tech.) Jacobian Transformation

Formula)
- \(\displaystyle\frac{\partial y}{\partial x^\top} = \begin{bmatrix} \mathbf{I}_d & \mathbf{0} \\ \frac{\partial y_{d+1:D}}{\partial x^\top_{1:d}} & \text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right) \end{bmatrix}\).
  - where
    - \(\text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right)\). : a diagonal matrix which diagonal elements corresponds to the vector \(\exp\left[ s(x_{1:d}) \right]\)..
    - \(s(x_{1:d}) \in\mathbb{R}^{D-d}\).
Determinant)
- \(\displaystyle\text{det}\left(\frac{\partial y}{\partial x^\top}\right) = \prod_{j=d+1}^{D} \exp\left[ s(x_{1:d})_j \right] = \exp\left[ \sum_{j=d+1}^D s(x_{1:d})_j \right]\).
  - where \(s(x_{1:d})_j\). is the \(j\).-th element of the vector \(s(x_{1:d}) \in\mathbb{R}^{D-d}\).
Invertibility)
- Forward)
  \(\begin{cases} y_{1:d} &= x_{1:d} \\ y_{d+1:D} &= x_{d+1:D} \odot \exp \left( s(x_{1:d}) \right) + t(x_{1:d}) \\ \end{cases}\).
- Backward)
  \(\begin{cases} x_{1:d} &= y_{1:d} \\ x_{d+1:D} &= (y_{d+1:D} - t(y_{1:d})) \odot \exp \left( -s(y_{1:d}) \right) + \\ \end{cases}\).

3.4 Masked convolution

Formula)
- \(y = b\odot x + (1-b) \odot \Big( x \odot \exp \big( s(b\odot x) \big) + t(b\odot x) \Big)\).
  - where
    - \(b = \Big[ \underbrace{1\;\cdots\;1}_{1\le j \le d} \; \underbrace{0\;\cdots\;0}_{d+1\le j \le D} \Big] \in\{0,1\}^D\). for \(j=1,\cdots,D\).
      - i.e.) Channel-wise Masking

Tech.) Spatial Checkerboard Pattern

Desc.)
- The spatial checkerboard pattern mask has value 1 where the sum of spatial coordinates is odd, and 0 otherwise.

Tech.) Channel-wise Masking

The channel-wise mask \(b\). is 1 for the first half of the channel dimensions and 0 for the second half.

3.5 Combining Coupling Layers

Why needed?)
- As we can see from the Jacobian Transformation, the forward transformation leaves \(x_{1:d}\). unchanged.
- Thus, by composing layers in an alternating pattern, they can be updated as well.
How?)
- Let
  - \(a = \{1,\cdots,d\}\).
  - \(b = \{d+1,\cdots,D\}\).
- Then we may compose two coupling layers of \(f_a\). and \(f_b\). as…
  - \(\displaystyle\frac{\partial(f_b\circ f_a)}{\partial x_a^\top}(x_a) = \frac{\partial f_a}{\partial x_a^\top}(x_a) \cdot \frac{\partial f_b}{\partial x_b^\top}(x_b)\).
    - where
      - \(x_b = f_a(x_a)\).
        
        cf.) Recall \(f:\mathbb{R}^d \rightarrow\mathbb{R}^{D-d}\).
Inverse)
- \((f_b \circ f_a)^{-1} = f_a^{-1}\circ f_b^{-1}\).

3.6 Multi-scale Architecture

Algorithm)
- Let
  - \(x \in \mathbb{R}^{s\times s}\). : the input data
  - \(L\). : the number of single layers + the final layer
    - i.e.)
      - the \(i=1,\cdots,L-1\).-th layers are the single layers
      - the \(L\).-th layer is the final layer
- Main function
  - reshape \(x \in \mathbb{R}^{s\times s \times c}\)., for \(c=1\).
  - \(i\leftarrow0,\quad h^{(i)}\leftarrow x\in \mathbb{R}^{s\times s \times c},\quad z=\Big[\underbrace{\text{null}\cdots\text{null}}_{L}\Big]\).
  - while \(s\gt 4\).
    - Apply the single layer transformation \(f^{i+1}\). at layer \(i+1\)..
      - \(f^{(i+1)}\left( h^{(i)} \right)\).
        
        Then by the squeezing operation, \(f^{(i+1)}\left( h^{(i)} \right) \in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 4c}\).
    - Halve the result into…
      - \(z^{(i+1)} \leftarrow f^{(i+1)}\left( h^{(i)} \right)[:2c]\in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 2c}\).
      - \(h^{(i+1)} \leftarrow f^{(i+1)}\left( h^{(i)} \right)[2c:]\in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 2c}\).
    - \(z[i+1] \leftarrow z^{(i+1)}\).
    - \(i \leftarrow i+1\).
  - Apply the final layer transformation of \(f^L\). on \(h^{(L-1)}\). as…
    - \(z^{(L)} \leftarrow f^{(L)}\left( h^{(L-1)} \right)\).

Tech.) Single-Scale Architecture

Composition)
- 3 coupling layers of alternating checkerboard masks
- Perform a squeezing operation
- 3 coupling layers of alternating channel-wise masking
  - To avoid the redundancy from the previous checkerboard masking.

Tech.) Squeezing Operation

Desc.)
- Let \(A\in\mathbb{R}^{s\times s\times c}\). be the input data format
- Then the operation modifies the dimension by
  - \(s\times s\times c \rightarrow \frac{s}{2}\times\frac{s}{2}\times 4c\).
e.g.)
- \(4\times4\times1 \rightarrow 2\times 2\times 4\).

Tech.) Final -Scale Architecture

Composition)
- Final 4 coupling layers of alternating checkerboard masks

3.7 Batch Normalization

Motivation)
- Deep residual networks improve the propagation of training signal
Desc.)
- Apply batch normalization to the whole coupling layer output.
  - How?)
    - Let \(N\). be the size of current the mini-batch
    - Calculate the estimated batch statics of \(\tilde{\mu}\). and \(\tilde{\sigma}^2\).
      - e.g.) \(\displaystyle\tilde{\mu} = \frac{1}{N}\sum_{j=1}^N y^j\).
        
        where \(y^j = \left[ y_{1:d}^j\;\; y_{d+1:D}^j] = [x_{1:d}^j\;\; x_{d+1:D}^j \odot \exp\left( s(x_{1:d}^j) \right) + t\left( x_{1:d}^j \right) \right]\).
    - Rescale the input as \(\displaystyle x \leftarrow \frac{x-\tilde{\mu}}{\sqrt{\tilde{\sigma}^2 + \epsilon}}\).
    - Then this rescaling function has a Jacobian determinant of
      - \(\displaystyle\left(\prod_{j} \left( \tilde{\sigma}_i^2 + \epsilon \right) \right)^{-\frac{1}{2}}\).
Moving Average)
- Why needed?)
  - Due to the memory issue, we may have to set small mini-batch
  - This results in the instability in training.
  - By normalizing the current latent using the accumulated latents from the previous training, we may stabilize the mini-batch training.
- How?)
  - Let
    - \(\tilde{\mu}_t, \tilde{\sigma}_t^2\). be the layer statistics accumulated over \(1,\ldots,t-1\). mini-batches.
    - \(\hat{\mu}_t, \hat{\sigma}_t^2\). be the current mini-batch statistics over \(j=1,\ldots,N\). datapoints in the \(t\).-th mini-batch
  - Then the moving average statistics for the next layer can be derived as
    - \(\tilde{\mu}_{t+1} = \rho \tilde{\mu}_t + (1-\rho)\hat{\mu}_t\).
    - \(\tilde{\sigma}_{t+1}^2 = \rho \tilde{\sigma}_t^2 + (1-\rho)\hat{\sigma}_t^2\).

4 Experiments

4.1 Procedure

Tech.) Jittering Procedure

Objective)
- Image data has the pixel values that typically lies in \([0,256]^D\)..
- This results in the boundary effect.
  - i.e.)
    - Generation will be made on the continuous distribution space, which is heterogeneous to the data distribution with the boundaries.
- To reconcile this issue, authors add Uniform noise.

hozy Summary

3 Model Definition

3.1 Change of variable formula

3.2 Coupling Layers

Concept) Affine Coupling Layer

3.3 Properties

Tech.) Jacobian Transformation

3.4 Masked convolution

Tech.) Spatial Checkerboard Pattern

Tech.) Channel-wise Masking

3.5 Combining Coupling Layers

3.6 Multi-scale Architecture

Tech.) Single-Scale Architecture

Tech.) Squeezing Operation

Tech.) Final -Scale Architecture

3.7 Batch Normalization

4 Experiments

4.1 Procedure

Tech.) Jittering Procedure

Enjoy Reading This Article?