NICE - Non-Linear Independent Components Estimation

Hozy Summary

  • NF models require simple Jacobian computation.
  • Previous methods has limited expressiveness
    • e.g.)
      • Affine Transformation : Model’s structure itself is too simple
      • Triangular Matrices (\(M=LU\).) : Design choice is limited
  • Solution
    • Authors suggest a way of forming a single layer with…
      • Coupling Layer
        • This is one way of achieving the triangular matrices (lower matrix).
        • Jacobian determinant is always 1.
        • Great flexibility in design choice : \(m\).
          • e.g.) \(m\). can be set as a neural network.
      • Rescaling
        • Above coupling layer’s fixed Jacobian determinant of 1 means the model cannot contract the data.
        • Thus, we may utilize additional top layer \(S\). that contracts certain dimensions.



2 Learning Bijective Transformations of Continuous Probabilities

  • Settings)
    • \(\mathcal{D}\). : a finite dataset of…
      • \(N\). samples
      • in space \(\mathcal{X}\).
        • i.e.) \(\mathcal{D} \subseteq \mathbb{R}^D\).
    • \(\{p_\theta, \theta\in \Theta\}\). : a parametric family of densities
    • \(h=f(x)\). : a transformation
      • where
        • \(f:\mathcal{D}\rightarrow\mathbb{R}^D\).
          • i.e. \(x\in\mathcal{D}, h\in\mathbb{R}^D\).
  • Goal)
    • Ask learner (NN) to find a transformation \(f\).


Concept) Change of Variables

  • Desc.)
    • Recall that the Normalizing Flow models transform the data \(x\). to the easy-to-handle latent \(h\). through transformation.
      • e.g.) \(h = f(x)\). where \(h\sim\mathcal{N}(0,\mathbf{I})\).
    • Since the transformation \(f:\mathcal{D}\rightarrow\mathbb{R}^D\). is a vector field, we need to verify the probability distribution of \(h\). using the change of variables rule.
  • Settings)
    • Put \(h = (h_1, h_2, \ldots, h_D) \in\mathbb{R}^D\).
    • \(p_H(h)\). : the prior distribution
    • \(p_H(h) = \displaystyle\prod_{d=1}^D p_{H_d}(h_d)\).
      • i.e.) Each element of \(h\). are independent.
    • Choose $f$$. s.t.
      • invertible
      • \(f^{-1}\). is trivially obtained
      • \(\text{det}\frac{\partial f(x)}{\partial x}\). is trivially obtained.
  • Probability Distribution
    • \(\displaystyle p_X(x) = p_H(f(x))\left\vert \text{det}\frac{\partial f(x)}{\partial x} \right\vert\).
      • where \(\frac{\partial f(x)}{\partial x}\). is the Jacobian matrix of \(f\). at \(x\).
        • i.e.) \(\displaystyle \frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \frac{\partial f_1(x)}{\partial x_2} & \cdots & \frac{\partial f_1(x)}{\partial x_D} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_D(x)}{\partial x_1} & \frac{\partial f_D(x)}{\partial x_2} & \cdots & \frac{\partial f_D(x)}{\partial x_D} \\ \end{bmatrix}\).


Tech.) Learning a continuous, differentiable almost everywhere non-linear transformation

  • Formula
    • \(\log(p_X(x)) = \log(p_H(f(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)\).
      • where
        • \(p_H(h)\). : the prior distribution
          • e.g.) standard isotropic Gaussian
  • Factorial Prior Case
    • Notation)
      • \(f(x) = (f_1(x), f_2(x), \ldots, f_D(x)) = (f_d(x))_{d\le D}\).
    • Non-Linear Independent Components Estimation (NICE) criterion
      • \(\log(p_X(x)) = \displaystyle\sum_{d=1}^D \log(p_{H_d}(f_d(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)\).
  • Props.)
    • The determinant of Jacobian terms suppress the increase in likelihood due to the transformation.
      • Desc.)
        • The transformation \((f(x) = (f_d(x))_{d\le D})\). contracts the data and increases the likelihood arbitrarily.
          • Why)
            • The destination is the simple Gaussian of \(p_H(h_D)\)..
            • Thus, all data will be concentrated in that simple distribution.
            • Increased Data distribution Density \(\Rightarrow\). Increased Probability Density
        • However, the \(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\). term counteracts the above by expanding regions of high density (i.e. data points!).
          • How?)
            • Recall that \(\vert\text{det}(\cdot)\vert\). calculates the hyper volume.
            • Also, Jacobian’s \((i,j)\).-th element is the partial differentiation.
              • \(\frac{\partial f(x_i)}{\partial x_j} \gt 1 \Leftrightarrow f(x_i)\text{ is expanding in the }x_j\text{ dim.}\).
              • \(\frac{\partial f(x_i)}{\partial x_j} \lt 1 \Leftrightarrow f(x_i)\text{ is contracting in the }x_j\text{ dim.}\).
            • Thus, if \(f(x) = (f_d(x))_{d\le D}\). contracts the data, \(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\). will approach 0.
            • Hence, \(\log\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\). is negative, penalizing the increase in \(\displaystyle\sum_{d=1}^D \log(p_{H_d}(f_d(x)))\).!

Concept) Autoencoder Framework

  • \(f\). : encoder
  • \(f^{-1}\). : decoder
  • Ancestral Sampling : \(H\rightarrow X\).



3 Architecture

3.1 Triangular Structure

  • Assumption)
    • We want to apply the NF of \(f = f_L\circ\cdots\circ f_1\). to better represent the data.
      • cf.)
        • Here, \(L\). denotes the number of layers.
        • \(D\). is the dimension in a single layer.
        • Thus, \(f_l : \mathbb{R}^D\rightarrow\mathbb{R}^D, \quad l=1,\ldots,L\).
    • Two critical conditions for choosing \(f\).
      • Computation is straightforward both forward(\(f\).) and backward(\(f^{-1}\).)
      • Tractability of Jacobian determinant
        • Why?) \(\ln q_L(\mathbf{z}_L) = \ln q_{0}(\mathbf{z}_{0}) + \sum_{l=1}^L \ln \left\vert\text{det}\frac{\partial f_l}{\partial \mathbf{z}_{l-1}}\right\vert\).
          • cf.) Refer to Rezende et al. for more details
          • cf.) The \(\ln \text{det}\). terms are added because we start from the data \(x\). to the latent \(z_L\)..
            • Rezende et al. starts from latent to data, so the \(\ln \text{det}\). terms are subtracted!
    • Previous Methods
      • Affine Transformation
        • Limit) The structure is so simple that the expressiveness of the model is limited.
      • Triangular Matrices
        • i.e.) \(M=LU\). where \(L\). is the lower, \(U\). is the upper triangular matrix.
        • Limit)
          • Limited design choices
          • Computationally efficient, but only represents linear transformations if \(M\). is a simple weight matrix.
  • Goal)
    • Obtain a family of bijections…
      • whose Jacobian determinant is tractable (lower triangular)
      • whose computation is straightforward both forward(\(f\).) and backward(\(f^{-1}\).)
        • where \(f=f_L\circ\cdots\circ f_1\).
    • Most of all, rich expressiveness of the model!
  • Suggestion)
    • Coupling Layer
      • Advantage)
        • Jacobian determinant is always 1!
          • This has the problem of incapability of contracting the data.
          • Authors resolve this issue using the rescaling \(S_{ii}\).
        • Choice of \(m\). is flexible!
          • We may utilize the expressiveness of the neural network here!


3.2 Coupling Layer

Concept) General Coupling Layer

  • Def.)
    • Let
      • \(x\in\mathcal{X}\).
      • \(I_1, I_2\). : the partitions of \([1, D]\). s.t.
        • \(d=\vert I_1 \vert\).
        • \(m : \mathbb{R}^d\rightarrow\mathbb{R}^d\). : the coupling function
    • Then, we may define \(y=(y_{I_1}, y_{I_2})\). as
      • \(y_{I_1} = x_{I_1}\).
      • \(y_{I_2} = g(x_{I_2};\;m(x_{I_1}))\).
        • where
          • \(g:\mathbb{R}^{D-d}\times m(\mathbb{R^d}) \rightarrow\mathbb{R}^{D-d}\). : the coupling law
        • Prop.)
          • Invertible map w.r.t. \(\mathbb{R}^{D-d}\). given \(m(\mathbb{R^d})\).
    • We may get the Jacobian as…
      • Put
        • \(I_1=[1,d]\).
        • \(I_2=[d+1,D]\).
      • Then, we may get the Jacobian by
        \(\begin{aligned} \text{det} \left(\frac{\partial y}{\partial x}\right) &= \text{det} \begin{bmatrix} \mathbf{I}_d & 0 \\ \frac{\partial y_{I_2}}{\partial x_{I_1}} & \frac{\partial y_{I_2}}{\partial x_{I_2}} \end{bmatrix} & \text{where } \mathbf{I}_d\in\mathbb{R}^d \text{ is the Identity Matrix} \\ &= \text{det} \left(\frac{\partial y_{I_2}}{\partial x_{I_2}}\right) \end{aligned}\).
    • Then, the inverse mapping goes
      • \(x_{I_1} = y_{I_1}\).
      • \(x_{I_2} = g^{-1}(y_{I_2};\;m(y_{I_1}))\).
  • e.g.)
    • Additive Coupling Layer
    • Multiplicative Coupling Layer
      • \(g(a;b) = a\odot b, \;b\ne0\).
    • Affine Coupling Layer
      • \(g(a;b) = a\odot b_1 + b_2, \;b\ne0\).
    • Combining Coupling Layers
      • At least three coupling layers are necessary to allow all dimensions to influence one another
      • Authors used 4
  • Problem)
    • The Jacobian determinant is always 1.
      • Why?)
        • Consider the Additive Coupling Layer below.
          • \(\displaystyle\frac{\partial y_{I_2}}{\partial x_{I_2}} = I_{D-d}\).
    • Recall that the NF model’s capability was contracting the data into the easy-to-handle distributed latent variable.
    • However, Jacobian determinant of 1 means the model cannot contract the data into the intended way.
    • Authors resolve this issue with rescaling.

Concept) Additive Coupling Layer

  • Desc.)
    • Simple implementation of General Coupling Layer using the additive coupling law of
      • \(g(a;\;b) = a+b\).
    • Then the mapping goes
      • \(y_{I_2} = x_{I_2} + m(x_{I_1})\).
      • \(x_{I_2} = y_{I_2} - m(y_{I_1})\).
  • Prop.)
    • Only as expensive as computing the transformation itself.
    • No restriction on choosing \(m\).
      • e.g.) a neural network with \(d\). input and \(D-d\). output
        • i.e.) \(y_{I_2} = x_{I_2} + \text{act}(y_{I_1}\mathbf{W} + \mathbf{b})\). where \(\text{act}\). is an activation function such as GeLU.
    • Unit Jacobian determinant
      • Why?) \(\displaystyle\frac{\partial y_{I_2}}{\partial x_{I_2}} = I_{D-d}\).


3.3 Allowing Rescaling

  • Goal)
    • Recall Coupling Layer’s incapability of contracting the data.
    • We want to artificially achieve this with the rescaling.
  • How?)
    • Let \(S\). be a diagonal matrix added at the top layer s.t.
      • cf.) \(S=[S_{ii}]_{i\le D}\). is a learnable parameter!
    • Then \(S_{ii}\). will be multiplied to the \(i\).-th dimension.
      • i.e.) \((x_i)_{i\le D} \rightarrow (S_{ii}x_i)_{i\le D}\).
    • Thus, the NICE criterion can be rewritten as
      • \(\log(p_X(x)) = \displaystyle\sum_{i=1}^D \left[ \log(p_{H_i}(f_i(x))) + \log\left\vert S_{ii} \right\vert \right]\).
  • Prop.)
    • Larger the \(S_{ii}\). is, the less important the dimension \(i\). is.
    • The prior term encourages \(S_{ii}\). to be small, while the determinant term \(\log S_{ii}\). prevents \(S_{ii}\). from ever reaching 0.


3.4 Prior Distribution

  • Desc.)
    • Recall that we choose the prior to be factorial as…
      • \(p_H(h) = \prod_{i=1}^D p_{H_d}(h_d)\).
    • General choice
      • Gaussian
        • \(\log(p_{H_d}) = -\frac{1}{2}(h_d^2 + \log(2\pi))\).
      • Logistic Distribution
        • \(\log(p_{H_d}) = -\log(1+\exp(h_d)) - \log(1+\exp(-h_d))\).



4 Optimization (NEEDS VERIFICATION)

  • MLE maximization problem
    • \(\displaystyle\arg\max_{S, m, \theta} \log(p_X(x)) = \displaystyle\sum_{i=1}^D \bigg[ \log(p_{H_i}(f_i(x))) + \log\left\vert S_{ii} \right\vert \bigg]\).



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Variational Inference with Normalizing Flows
  • Density Estimation Using Real NVP
  • Flow Matching for Generative Modeling (CFM)
  • Variational Autoencoder Bayes (VAE)
  • (DM Reconst.) Ch.2 Variational Perspective - From VAEs to DDPM