Density Estimation Using Real NVP

hozy Summary

  • Previous NICE was limited in expressiveness because its additive coupling method maintained a Jacobian determinant of 1 (Volume-preserving), which was not capable of contracting or expanding the data distributions flexibly.
  • The Real NVP suggests the multi-scale architecture that solves this by:
    • Introducing Affine coupling layers (scaling and translation) to construct a Non-Volume Preserving transformation.
      • The scaling function \(s : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). in each layer replaces the re-scaling from NICE
    • Performing a Squeezing operation at each scale, which trades spatial size for the number of channels.
    • Utilizing checkerboard and channel-wise masking patterns to fully exploit the local correlation structure of images and increase the model’s expressiveness.
    • Efficiently lowering the computational/memory cost by factoring out half of the variables directly to the latent space at regular intervals.



3 Model Definition

3.1 Change of variable formula

  • Settings)
    • \(x\in X\). : an observed data variable
    • \(z\in Z\). : a latent variable
      • where \(z\sim p_Z\).
    • \(f:X\rightarrow Z\). : a bijection
      • with \(g = f^{-1}\).
  • Formula)
    • \(p_X(x) = p_Z(f(x)) \displaystyle \left\vert\text{det}\left(\frac{\partial f(x)}{\partial x^\top}\right)\right\vert\).
    • \(\log p_X(x) = \log p_Z(f(x)) + \log \displaystyle \left\vert\text{det}\left(\frac{\partial f(x)}{\partial x^\top}\right)\right\vert\).
      • where \(\frac{\partial f(x)}{\partial x^\top}\). is the Jacobian of \(f\). at \(x\).
  • Sampling)
    • Draw \(z\sim p_Z\).
    • Image in the original space goes \(g(z) = x\).


3.2 Coupling Layers

Concept) Affine Coupling Layer

  • Settings)
    • \(d\lt D\).
    • \(x\in \mathbb{R}^D\). : the input
    • \(y\in \mathbb{R}^D\). : the output
  • Model)
    • \(y_{1:d} = x_{1:d}\).
    • \(y_{d+1:D} = x_{d+1:D} \odot \exp\left( s(x_{1:d}) \right) + t\left( x_{1:d} \right)\).
      • where
        • \(s : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). : scale
        • \(t : \mathbb{R}^d\rightarrow\mathbb{R}^{D-d}\). : translation
        • \(\odot\). : the Hadamard product
  • Graphical Desc.)


3.3 Properties

Tech.) Jacobian Transformation

  • Formula)
    • \(\displaystyle\frac{\partial y}{\partial x^\top} = \begin{bmatrix} \mathbf{I}_d & \mathbf{0} \\ \frac{\partial y_{d+1:D}}{\partial x^\top_{1:d}} & \text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right) \end{bmatrix}\).
      • where
        • \(\text{diag}\left( \exp\left[ s(x_{1:d}) \right] \right)\). : a diagonal matrix which diagonal elements corresponds to the vector \(\exp\left[ s(x_{1:d}) \right]\)..
        • \(s(x_{1:d}) \in\mathbb{R}^{D-d}\).
  • Determinant)
    • \(\displaystyle\text{det}\left(\frac{\partial y}{\partial x^\top}\right) = \prod_{j=d+1}^{D} \exp\left[ s(x_{1:d})_j \right] = \exp\left[ \sum_{j=d+1}^D s(x_{1:d})_j \right]\).
      • where \(s(x_{1:d})_j\). is the \(j\).-th element of the vector \(s(x_{1:d}) \in\mathbb{R}^{D-d}\).
  • Invertibility)
    • Forward)
      \(\begin{cases} y_{1:d} &= x_{1:d} \\ y_{d+1:D} &= x_{d+1:D} \odot \exp \left( s(x_{1:d}) \right) + t(x_{1:d}) \\ \end{cases}\).
    • Backward)
      \(\begin{cases} x_{1:d} &= y_{1:d} \\ x_{d+1:D} &= (y_{d+1:D} - t(y_{1:d})) \odot \exp \left( -s(y_{1:d}) \right) + \\ \end{cases}\).


3.4 Masked convolution

  • Formula)
    • \(y = b\odot x + (1-b) \odot \Big( x \odot \exp \big( s(b\odot x) \big) + t(b\odot x) \Big)\).
      • where
        • \(b = \Big[ \underbrace{1\;\cdots\;1}_{1\le j \le d} \; \underbrace{0\;\cdots\;0}_{d+1\le j \le D} \Big] \in\{0,1\}^D\). for \(j=1,\cdots,D\).

Tech.) Spatial Checkerboard Pattern

  • Desc.)
    • The spatial checkerboard pattern mask has value 1 where the sum of spatial coordinates is odd, and 0 otherwise.

Tech.) Channel-wise Masking

  • The channel-wise mask \(b\). is 1 for the first half of the channel dimensions and 0 for the second half.


3.5 Combining Coupling Layers

  • Why needed?)
    • As we can see from the Jacobian Transformation, the forward transformation leaves \(x_{1:d}\). unchanged.
    • Thus, by composing layers in an alternating pattern, they can be updated as well.
  • How?)
    • Let
      • \(a = \{1,\cdots,d\}\).
      • \(b = \{d+1,\cdots,D\}\).
    • Then we may compose two coupling layers of \(f_a\). and \(f_b\). as…
      • \(\displaystyle\frac{\partial(f_b\circ f_a)}{\partial x_a^\top}(x_a) = \frac{\partial f_a}{\partial x_a^\top}(x_a) \cdot \frac{\partial f_b}{\partial x_b^\top}(x_b)\).
        • where
          • \(x_b = f_a(x_a)\).
            • cf.) Recall \(f:\mathbb{R}^d \rightarrow\mathbb{R}^{D-d}\).
  • Inverse)
    • \((f_b \circ f_a)^{-1} = f_a^{-1}\circ f_b^{-1}\).


3.6 Multi-scale Architecture

  • Algorithm)
    • Let
    • Main function
      • reshape \(x \in \mathbb{R}^{s\times s \times c}\)., for \(c=1\).
      • \(i\leftarrow0,\quad h^{(i)}\leftarrow x\in \mathbb{R}^{s\times s \times c},\quad z=\Big[\underbrace{\text{null}\cdots\text{null}}_{L}\Big]\).
      • while \(s\gt 4\).
        • Apply the single layer transformation \(f^{i+1}\). at layer \(i+1\)..
          • \(f^{(i+1)}\left( h^{(i)} \right)\).
            • Then by the squeezing operation, \(f^{(i+1)}\left( h^{(i)} \right) \in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 4c}\).
        • Halve the result into…
          • \(z^{(i+1)} \leftarrow f^{(i+1)}\left( h^{(i)} \right)[:2c]\in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 2c}\).
          • \(h^{(i+1)} \leftarrow f^{(i+1)}\left( h^{(i)} \right)[2c:]\in \mathbb{R}^{\frac{s}{2}\times\frac{s}{2}\times 2c}\).
        • \(z[i+1] \leftarrow z^{(i+1)}\).
        • \(i \leftarrow i+1\).
      • Apply the final layer transformation of \(f^L\). on \(h^{(L-1)}\). as…
        • \(z^{(L)} \leftarrow f^{(L)}\left( h^{(L-1)} \right)\).


Tech.) Single-Scale Architecture

Tech.) Squeezing Operation

  • Desc.)
    • Let \(A\in\mathbb{R}^{s\times s\times c}\). be the input data format
    • Then the operation modifies the dimension by
      • \(s\times s\times c \rightarrow \frac{s}{2}\times\frac{s}{2}\times 4c\).
  • e.g.)
    • \(4\times4\times1 \rightarrow 2\times 2\times 4\).

Tech.) Final -Scale Architecture


3.7 Batch Normalization

  • Motivation)
    • Deep residual networks improve the propagation of training signal
  • Desc.)
    • Apply batch normalization to the whole coupling layer output.
      • How?)
        • Let \(N\). be the size of current the mini-batch
        • Calculate the estimated batch statics of \(\tilde{\mu}\). and \(\tilde{\sigma}^2\).
          • e.g.) \(\displaystyle\tilde{\mu} = \frac{1}{N}\sum_{j=1}^N y^j\).
            • where \(y^j = \left[ y_{1:d}^j\;\; y_{d+1:D}^j] = [x_{1:d}^j\;\; x_{d+1:D}^j \odot \exp\left( s(x_{1:d}^j) \right) + t\left( x_{1:d}^j \right) \right]\).
        • Rescale the input as \(\displaystyle x \leftarrow \frac{x-\tilde{\mu}}{\sqrt{\tilde{\sigma}^2 + \epsilon}}\).
        • Then this rescaling function has a Jacobian determinant of
          • \(\displaystyle\left(\prod_{j} \left( \tilde{\sigma}_i^2 + \epsilon \right) \right)^{-\frac{1}{2}}\).
  • Moving Average)
    • Why needed?)
      • Due to the memory issue, we may have to set small mini-batch
      • This results in the instability in training.
      • By normalizing the current latent using the accumulated latents from the previous training, we may stabilize the mini-batch training.
    • How?)
      • Let
        • \(\tilde{\mu}_t, \tilde{\sigma}_t^2\). be the layer statistics accumulated over \(1,\ldots,t-1\). mini-batches.
        • \(\hat{\mu}_t, \hat{\sigma}_t^2\). be the current mini-batch statistics over \(j=1,\ldots,N\). datapoints in the \(t\).-th mini-batch
      • Then the moving average statistics for the next layer can be derived as
        • \(\tilde{\mu}_{t+1} = \rho \tilde{\mu}_t + (1-\rho)\hat{\mu}_t\).
        • \(\tilde{\sigma}_{t+1}^2 = \rho \tilde{\sigma}_t^2 + (1-\rho)\hat{\sigma}_t^2\).


4 Experiments

4.1 Procedure

Tech.) Jittering Procedure

  • Objective)
    • Image data has the pixel values that typically lies in \([0,256]^D\)..
    • This results in the boundary effect.
      • i.e.)
        • Generation will be made on the continuous distribution space, which is heterogeneous to the data distribution with the boundaries.
    • To reconcile this issue, authors add Uniform noise.



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • NICE - Non-Linear Independent Components Estimation
  • (Presentation PDF) A Style-Based Generator Architecture for Generative Adversarial Networks (Style GAN)
  • Normalizing Flows are Capable of Generative Models (Tarflow)
  • (DM Reconst.) Ch.3 Score-Based Perspective - From EBMs to NCSN
  • Variational Autoencoder Bayes (VAE)