NICE - Non-Linear Independent Components Estimation

Hozy Summary

NF models require simple Jacobian computation.
Previous methods has limited expressiveness
- e.g.)
  - Affine Transformation : Model’s structure itself is too simple
  - Triangular Matrices ($M=LU$.) : Design choice is limited
Solution
- Authors suggest a way of forming a single layer with…
  - Coupling Layer
    - This is one way of achieving the triangular matrices (lower matrix).
    - Jacobian determinant is always 1.
    - Great flexibility in design choice : $m$.
      - e.g.) $m$. can be set as a neural network.
  - Rescaling
    - Above coupling layer’s fixed Jacobian determinant of 1 means the model cannot contract the data.
      - cf.) Refer to the explanation below for the contraction issue.
    - Thus, we may utilize additional top layer $S$. that contracts certain dimensions.

2 Learning Bijective Transformations of Continuous Probabilities

Settings)
- $\mathcal{D}$. : a finite dataset of…
  - $N$. samples
  - in space $\mathcal{X}$.
    - i.e.) $\mathcal{D} \subseteq \mathbb{R}^D$.
- $\{p_\theta, \theta\in \Theta\}$. : a parametric family of densities
- $h=f(x)$. : a transformation
  - where
    - $f:\mathcal{D}\rightarrow\mathbb{R}^D$.
      - i.e. $x\in\mathcal{D}, h\in\mathbb{R}^D$.
Goal)
- Ask learner (NN) to find a transformation $f$.

Concept) Change of Variables

Desc.)
- Recall that the Normalizing Flow models transform the data $x$. to the easy-to-handle latent $h$. through transformation.
  - e.g.) $h = f(x)$. where $h\sim\mathcal{N}(0,\mathbf{I})$.
- Since the transformation $f:\mathcal{D}\rightarrow\mathbb{R}^D$. is a vector field, we need to verify the probability distribution of $h$. using the change of variables rule.
Settings)
- Put $h = (h_1, h_2, \ldots, h_D) \in\mathbb{R}^D$.
- $p_H(h)$. : the prior distribution
  - e.g.) standard isotropic Gaussian
  - cf.) Choice of the prior is explained in more detail below.
- $p_H(h) = \displaystyle\prod_{d=1}^D p_{H_d}(h_d)$.
  - i.e.) Each element of $h$. are independent.
- Choose $f$$. s.t.
  - invertible
  - $f^{-1}$. is trivially obtained
  - $\text{det}\frac{\partial f(x)}{\partial x}$. is trivially obtained.
Probability Distribution
- $\displaystyle p_X(x) = p_H(f(x))\left\vert \text{det}\frac{\partial f(x)}{\partial x} \right\vert$.
  - where $\frac{\partial f(x)}{\partial x}$. is the Jacobian matrix of $f$. at $x$.
    - i.e.) $\displaystyle \frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \frac{\partial f_1(x)}{\partial x_2} & \cdots & \frac{\partial f_1(x)}{\partial x_D} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_D(x)}{\partial x_1} & \frac{\partial f_D(x)}{\partial x_2} & \cdots & \frac{\partial f_D(x)}{\partial x_D} \\ \end{bmatrix}$.

Tech.) Learning a continuous, differentiable almost everywhere non-linear transformation

Formula
- $\log(p_X(x)) = \log(p_H(f(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)$.
  - where
    - $p_H(h)$. : the prior distribution
      - e.g.) standard isotropic Gaussian
Factorial Prior Case
- Notation)
  - $f(x) = (f_1(x), f_2(x), \ldots, f_D(x)) = (f_d(x))_{d\le D}$.
- Non-Linear Independent Components Estimation (NICE) criterion
  - $\log(p_X(x)) = \displaystyle\sum_{d=1}^D \log(p_{H_d}(f_d(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)$.
Props.)
- The determinant of Jacobian terms suppress the increase in likelihood due to the transformation.
  - Desc.)
    - The transformation $(f(x) = (f_d(x))_{d\le D})$. contracts the data and increases the likelihood arbitrarily.
      - Why)
        
        The destination is the simple Gaussian of $p_H(h_D)$..
        
        Thus, all data will be concentrated in that simple distribution.
        
        Increased Data distribution Density $\Rightarrow$. Increased Probability Density
    - However, the $\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert$. term counteracts the above by expanding regions of high density (i.e. data points!).
      - How?)
        
        Recall that $\vert\text{det}(\cdot)\vert$. calculates the hyper volume.
        
        Also, Jacobian’s $(i,j)$.-th element is the partial differentiation.
        
        $\frac{\partial f(x_i)}{\partial x_j} \gt 1 \Leftrightarrow f(x_i)\text{ is expanding in the }x_j\text{ dim.}$.
        
        $\frac{\partial f(x_i)}{\partial x_j} \lt 1 \Leftrightarrow f(x_i)\text{ is contracting in the }x_j\text{ dim.}$.
        
        Thus, if $f(x) = (f_d(x))_{d\le D}$. contracts the data, $\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert$. will approach 0.
        
        Hence, $\log\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert$. is negative, penalizing the increase in $\displaystyle\sum_{d=1}^D \log(p_{H_d}(f_d(x)))$.!

Concept) Autoencoder Framework

$f$. : encoder
$f^{-1}$. : decoder
Ancestral Sampling : $H\rightarrow X$.

3 Architecture

3.1 Triangular Structure

Assumption)
- We want to apply the NF of $f = f_L\circ\cdots\circ f_1$. to better represent the data.
  - cf.)
    - Here, $L$. denotes the number of layers.
    - $D$. is the dimension in a single layer.
    - Thus, $f_l : \mathbb{R}^D\rightarrow\mathbb{R}^D, \quad l=1,\ldots,L$.
- Two critical conditions for choosing $f$.
  - Computation is straightforward both forward($f$.) and backward($f^{-1}$.)
  - Tractability of Jacobian determinant
    - Why?) $\ln q_L(\mathbf{z}_L) = \ln q_{0}(\mathbf{z}_{0}) + \sum_{l=1}^L \ln \left\vert\text{det}\frac{\partial f_l}{\partial \mathbf{z}_{l-1}}\right\vert$.
      - cf.) Refer to Rezende et al. for more details
      - cf.) The $\ln \text{det}$. terms are added because we start from the data $x$. to the latent $z_L$..
        
        Rezende et al. starts from latent to data, so the $\ln \text{det}$. terms are subtracted!
- Previous Methods
  - Affine Transformation
    - Limit) The structure is so simple that the expressiveness of the model is limited.
  - Triangular Matrices
    - i.e.) $M=LU$. where $L$. is the lower, $U$. is the upper triangular matrix.
    - Limit)
      - Limited design choices
      - Computationally efficient, but only represents linear transformations if $M$. is a simple weight matrix.
Goal)
- Obtain a family of bijections…
  - whose Jacobian determinant is tractable (lower triangular)
  - whose computation is straightforward both forward($f$.) and backward($f^{-1}$.)
    - where $f=f_L\circ\cdots\circ f_1$.
- Most of all, rich expressiveness of the model!
Suggestion)
- Coupling Layer
  - Advantage)
    - Jacobian determinant is always 1!
      - This has the problem of incapability of contracting the data.
      - Authors resolve this issue using the rescaling $S_{ii}$.
    - Choice of $m$. is flexible!
      - We may utilize the expressiveness of the neural network here!

3.2 Coupling Layer

Concept) General Coupling Layer

Def.)
- Let
  - $x\in\mathcal{X}$.
  - $I_1, I_2$. : the partitions of $[1, D]$. s.t.
    - $d=\vert I_1 \vert$.
    - $m : \mathbb{R}^d\rightarrow\mathbb{R}^d$. : the coupling function
- Then, we may define $y=(y_{I_1}, y_{I_2})$. as
  - $y_{I_1} = x_{I_1}$.
  - $y_{I_2} = g(x_{I_2};\;m(x_{I_1}))$.
    - where
      - $g:\mathbb{R}^{D-d}\times m(\mathbb{R^d}) \rightarrow\mathbb{R}^{D-d}$. : the coupling law
    - Prop.)
      - Invertible map w.r.t. $\mathbb{R}^{D-d}$. given $m(\mathbb{R^d})$.
- We may get the Jacobian as…
  - Put
    - $I_1=[1,d]$.
    - $I_2=[d+1,D]$.
  - Then, we may get the Jacobian by
    $\begin{aligned} \text{det} \left(\frac{\partial y}{\partial x}\right) &= \text{det} \begin{bmatrix} \mathbf{I}_d & 0 \\ \frac{\partial y_{I_2}}{\partial x_{I_1}} & \frac{\partial y_{I_2}}{\partial x_{I_2}} \end{bmatrix} & \text{where } \mathbf{I}_d\in\mathbb{R}^d \text{ is the Identity Matrix} \\ &= \text{det} \left(\frac{\partial y_{I_2}}{\partial x_{I_2}}\right) \end{aligned}$.
- Then, the inverse mapping goes
  - $x_{I_1} = y_{I_1}$.
  - $x_{I_2} = g^{-1}(y_{I_2};\;m(y_{I_1}))$.
e.g.)
- Additive Coupling Layer
- Multiplicative Coupling Layer
  - $g(a;b) = a\odot b, \;b\ne0$.
- Affine Coupling Layer
  - $g(a;b) = a\odot b_1 + b_2, \;b\ne0$.
- Combining Coupling Layers
  - At least three coupling layers are necessary to allow all dimensions to influence one another
  - Authors used 4
Problem)
- The Jacobian determinant is always 1.
  - Why?)
    - Consider the Additive Coupling Layer below.
      - $\displaystyle\frac{\partial y_{I_2}}{\partial x_{I_2}} = I_{D-d}$.
- Recall that the NF model’s capability was contracting the data into the easy-to-handle distributed latent variable.
- However, Jacobian determinant of 1 means the model cannot contract the data into the intended way.
- Authors resolve this issue with rescaling.

Concept) Additive Coupling Layer

Desc.)
- Simple implementation of General Coupling Layer using the additive coupling law of
  - $g(a;\;b) = a+b$.
- Then the mapping goes
  - $y_{I_2} = x_{I_2} + m(x_{I_1})$.
  - $x_{I_2} = y_{I_2} - m(y_{I_1})$.
Prop.)
- Only as expensive as computing the transformation itself.
- No restriction on choosing $m$.
  - e.g.) a neural network with $d$. input and $D-d$. output
    - i.e.) $y_{I_2} = x_{I_2} + \text{act}(y_{I_1}\mathbf{W} + \mathbf{b})$. where $\text{act}$. is an activation function such as GeLU.
- Unit Jacobian determinant
  - Why?) $\displaystyle\frac{\partial y_{I_2}}{\partial x_{I_2}} = I_{D-d}$.

3.3 Allowing Rescaling

Goal)
- Recall Coupling Layer’s incapability of contracting the data.
- We want to artificially achieve this with the rescaling.
How?)
- Let $S$. be a diagonal matrix added at the top layer s.t.
  - cf.) $S=[S_{ii}]_{i\le D}$. is a learnable parameter!
- Then $S_{ii}$. will be multiplied to the $i$.-th dimension.
  - i.e.) $(x_i)_{i\le D} \rightarrow (S_{ii}x_i)_{i\le D}$.
- Thus, the NICE criterion can be rewritten as
  - $\log(p_X(x)) = \displaystyle\sum_{i=1}^D \left[ \log(p_{H_i}(f_i(x))) + \log\left\vert S_{ii} \right\vert \right]$.
Prop.)
- Larger the $S_{ii}$. is, the less important the dimension $i$. is.
- The prior term encourages $S_{ii}$. to be small, while the determinant term $\log S_{ii}$. prevents $S_{ii}$. from ever reaching 0.

3.4 Prior Distribution

Desc.)
- Recall that we choose the prior to be factorial as…
  - $p_H(h) = \prod_{i=1}^D p_{H_d}(h_d)$.
- General choice
  - Gaussian
    - $\log(p_{H_d}) = -\frac{1}{2}(h_d^2 + \log(2\pi))$.
  - Logistic Distribution
    - $\log(p_{H_d}) = -\log(1+\exp(h_d)) - \log(1+\exp(-h_d))$.

4 Optimization (NEEDS VERIFICATION)

MLE maximization problem
- $\displaystyle\arg\max_{S, m, \theta} \log(p_X(x)) = \displaystyle\sum_{i=1}^D \bigg[ \log(p_{H_i}(f_i(x))) + \log\left\vert S_{ii} \right\vert \bigg]$.

Hozy Summary

2 Learning Bijective Transformations of Continuous Probabilities

Concept) Change of Variables

Tech.) Learning a continuous, differentiable almost everywhere non-linear transformation

Concept) Autoencoder Framework

3 Architecture

3.1 Triangular Structure

3.2 Coupling Layer

Concept) General Coupling Layer

Concept) Additive Coupling Layer

3.3 Allowing Rescaling

3.4 Prior Distribution

4 Optimization (NEEDS VERIFICATION)

Enjoy Reading This Article?