NICE - Non-Linear Independent Components Estimation
Hozy Summary
- NF models require simple Jacobian computation.
- Previous methods has limited expressiveness
- e.g.)
- Affine Transformation : Model’s structure itself is too simple
- Triangular Matrices (\(M=LU\).) : Design choice is limited
- e.g.)
- Solution
- Authors suggest a way of forming a single layer with…
- Coupling Layer
- This is one way of achieving the triangular matrices (lower matrix).
- Jacobian determinant is always 1.
- Great flexibility in design choice : \(m\).
- e.g.) \(m\). can be set as a neural network.
- Rescaling
- Above coupling layer’s fixed Jacobian determinant of 1 means the model cannot contract the data.
- cf.) Refer to the explanation below for the contraction issue.
- Thus, we may utilize additional top layer \(S\). that contracts certain dimensions.
- Above coupling layer’s fixed Jacobian determinant of 1 means the model cannot contract the data.
- Coupling Layer
- Authors suggest a way of forming a single layer with…
2 Learning Bijective Transformations of Continuous Probabilities
- Settings)
- \(\mathcal{D}\). : a finite dataset of…
- \(N\). samples
- in space \(\mathcal{X}\).
- i.e.) \(\mathcal{D} \subseteq \mathbb{R}^D\).
- \(\{p_\theta, \theta\in \Theta\}\). : a parametric family of densities
- \(h=f(x)\). : a transformation
- where
- \(f:\mathcal{D}\rightarrow\mathbb{R}^D\).
- i.e. \(x\in\mathcal{D}, h\in\mathbb{R}^D\).
- \(f:\mathcal{D}\rightarrow\mathbb{R}^D\).
- where
- \(\mathcal{D}\). : a finite dataset of…
- Goal)
- Ask learner (NN) to find a transformation \(f\).
Concept) Change of Variables
- Desc.)
- Recall that the Normalizing Flow models transform the data \(x\). to the easy-to-handle latent \(h\). through transformation.
- e.g.) \(h = f(x)\). where \(h\sim\mathcal{N}(0,\mathbf{I})\).
- Since the transformation \(f:\mathcal{D}\rightarrow\mathbb{R}^D\). is a vector field, we need to verify the probability distribution of \(h\). using the change of variables rule.
- Recall that the Normalizing Flow models transform the data \(x\). to the easy-to-handle latent \(h\). through transformation.
- Settings)
- Put \(h = (h_1, h_2, \ldots, h_D) \in\mathbb{R}^D\).
- \(p_H(h)\). : the prior distribution
- e.g.) standard isotropic Gaussian
- cf.) Choice of the prior is explained in more detail below.
- \(p_H(h) = \displaystyle\prod_{d=1}^D p_{H_d}(h_d)\).
- i.e.) Each element of \(h\). are independent.
- Choose $f$$. s.t.
- invertible
- \(f^{-1}\). is trivially obtained
- \(\text{det}\frac{\partial f(x)}{\partial x}\). is trivially obtained.
- Probability Distribution
- \(\displaystyle p_X(x) = p_H(f(x))\left\vert \text{det}\frac{\partial f(x)}{\partial x} \right\vert\).
- where \(\frac{\partial f(x)}{\partial x}\). is the Jacobian matrix of \(f\). at \(x\).
- i.e.) \(\displaystyle \frac{\partial f(x)}{\partial x} = \begin{bmatrix} \frac{\partial f_1(x)}{\partial x_1} & \frac{\partial f_1(x)}{\partial x_2} & \cdots & \frac{\partial f_1(x)}{\partial x_D} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_D(x)}{\partial x_1} & \frac{\partial f_D(x)}{\partial x_2} & \cdots & \frac{\partial f_D(x)}{\partial x_D} \\ \end{bmatrix}\).
- where \(\frac{\partial f(x)}{\partial x}\). is the Jacobian matrix of \(f\). at \(x\).
- \(\displaystyle p_X(x) = p_H(f(x))\left\vert \text{det}\frac{\partial f(x)}{\partial x} \right\vert\).
Tech.) Learning a continuous, differentiable almost everywhere non-linear transformation
- Formula
- \(\log(p_X(x)) = \log(p_H(f(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)\).
- where
- \(p_H(h)\). : the prior distribution
- e.g.) standard isotropic Gaussian
- \(p_H(h)\). : the prior distribution
- where
- \(\log(p_X(x)) = \log(p_H(f(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)\).
- Factorial Prior Case
- Notation)
- \(f(x) = (f_1(x), f_2(x), \ldots, f_D(x)) = (f_d(x))_{d\le D}\).
- Non-Linear Independent Components Estimation (NICE) criterion
- \(\log(p_X(x)) = \displaystyle\sum_{d=1}^D \log(p_{H_d}(f_d(x))) + \log\left(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\right)\).
- Notation)
- Props.)
- The determinant of Jacobian terms suppress the increase in likelihood due to the transformation.
- Desc.)
- The transformation \((f(x) = (f_d(x))_{d\le D})\). contracts the data and increases the likelihood arbitrarily.
- Why)
- The destination is the simple Gaussian of \(p_H(h_D)\)..
- Thus, all data will be concentrated in that simple distribution.
- Increased Data distribution Density \(\Rightarrow\). Increased Probability Density
- Why)
- However, the \(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\). term counteracts the above by expanding regions of high density (i.e. data points!).
- How?)
- Recall that \(\vert\text{det}(\cdot)\vert\). calculates the hyper volume.
- Also, Jacobian’s \((i,j)\).-th element is the partial differentiation.
- \(\frac{\partial f(x_i)}{\partial x_j} \gt 1 \Leftrightarrow f(x_i)\text{ is expanding in the }x_j\text{ dim.}\).
- \(\frac{\partial f(x_i)}{\partial x_j} \lt 1 \Leftrightarrow f(x_i)\text{ is contracting in the }x_j\text{ dim.}\).
- Thus, if \(f(x) = (f_d(x))_{d\le D}\). contracts the data, \(\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\). will approach 0.
- Hence, \(\log\left\vert\text{det}\frac{\partial f(x)}{\partial x}\right\vert\). is negative, penalizing the increase in \(\displaystyle\sum_{d=1}^D \log(p_{H_d}(f_d(x)))\).!
- How?)
- The transformation \((f(x) = (f_d(x))_{d\le D})\). contracts the data and increases the likelihood arbitrarily.
- Desc.)
- The determinant of Jacobian terms suppress the increase in likelihood due to the transformation.
Concept) Autoencoder Framework
- \(f\). : encoder
- \(f^{-1}\). : decoder
- Ancestral Sampling : \(H\rightarrow X\).
3 Architecture
3.1 Triangular Structure
- Assumption)
- We want to apply the NF of \(f = f_L\circ\cdots\circ f_1\). to better represent the data.
- cf.)
- Here, \(L\). denotes the number of layers.
- \(D\). is the dimension in a single layer.
- Thus, \(f_l : \mathbb{R}^D\rightarrow\mathbb{R}^D, \quad l=1,\ldots,L\).
- cf.)
- Two critical conditions for choosing \(f\).
- Computation is straightforward both forward(\(f\).) and backward(\(f^{-1}\).)
- Tractability of Jacobian determinant
- Why?) \(\ln q_L(\mathbf{z}_L) = \ln q_{0}(\mathbf{z}_{0}) + \sum_{l=1}^L \ln \left\vert\text{det}\frac{\partial f_l}{\partial \mathbf{z}_{l-1}}\right\vert\).
- cf.) Refer to Rezende et al. for more details
- cf.) The \(\ln \text{det}\). terms are added because we start from the data \(x\). to the latent \(z_L\)..
- Rezende et al. starts from latent to data, so the \(\ln \text{det}\). terms are subtracted!
- Why?) \(\ln q_L(\mathbf{z}_L) = \ln q_{0}(\mathbf{z}_{0}) + \sum_{l=1}^L \ln \left\vert\text{det}\frac{\partial f_l}{\partial \mathbf{z}_{l-1}}\right\vert\).
- Previous Methods
- Affine Transformation
- Limit) The structure is so simple that the expressiveness of the model is limited.
- Triangular Matrices
- i.e.) \(M=LU\). where \(L\). is the lower, \(U\). is the upper triangular matrix.
- Limit)
- Limited design choices
- Computationally efficient, but only represents linear transformations if \(M\). is a simple weight matrix.
- Affine Transformation
- We want to apply the NF of \(f = f_L\circ\cdots\circ f_1\). to better represent the data.
- Goal)
- Obtain a family of bijections…
- whose Jacobian determinant is tractable (lower triangular)
- whose computation is straightforward both forward(\(f\).) and backward(\(f^{-1}\).)
- where \(f=f_L\circ\cdots\circ f_1\).
- Most of all, rich expressiveness of the model!
- Obtain a family of bijections…
- Suggestion)
- Coupling Layer
- Advantage)
- Jacobian determinant is always 1!
- This has the problem of incapability of contracting the data.
- Authors resolve this issue using the rescaling \(S_{ii}\).
- Choice of \(m\). is flexible!
- We may utilize the expressiveness of the neural network here!
- Jacobian determinant is always 1!
- Advantage)
- Coupling Layer
3.2 Coupling Layer
Concept) General Coupling Layer
- Def.)
- Let
- \(x\in\mathcal{X}\).
- \(I_1, I_2\). : the partitions of \([1, D]\). s.t.
- \(d=\vert I_1 \vert\).
- \(m : \mathbb{R}^d\rightarrow\mathbb{R}^d\). : the coupling function
- Then, we may define \(y=(y_{I_1}, y_{I_2})\). as
- \(y_{I_1} = x_{I_1}\).
- \(y_{I_2} = g(x_{I_2};\;m(x_{I_1}))\).
- where
- \(g:\mathbb{R}^{D-d}\times m(\mathbb{R^d}) \rightarrow\mathbb{R}^{D-d}\). : the coupling law
- Prop.)
- Invertible map w.r.t. \(\mathbb{R}^{D-d}\). given \(m(\mathbb{R^d})\).
- where
- We may get the Jacobian as…
- Put
- \(I_1=[1,d]\).
- \(I_2=[d+1,D]\).
- Then, we may get the Jacobian by
\(\begin{aligned} \text{det} \left(\frac{\partial y}{\partial x}\right) &= \text{det} \begin{bmatrix} \mathbf{I}_d & 0 \\ \frac{\partial y_{I_2}}{\partial x_{I_1}} & \frac{\partial y_{I_2}}{\partial x_{I_2}} \end{bmatrix} & \text{where } \mathbf{I}_d\in\mathbb{R}^d \text{ is the Identity Matrix} \\ &= \text{det} \left(\frac{\partial y_{I_2}}{\partial x_{I_2}}\right) \end{aligned}\).
- Put
- Then, the inverse mapping goes
- \(x_{I_1} = y_{I_1}\).
- \(x_{I_2} = g^{-1}(y_{I_2};\;m(y_{I_1}))\).
- Let
- e.g.)
- Additive Coupling Layer
- Multiplicative Coupling Layer
- \(g(a;b) = a\odot b, \;b\ne0\).
- Affine Coupling Layer
- \(g(a;b) = a\odot b_1 + b_2, \;b\ne0\).
- Combining Coupling Layers
- At least three coupling layers are necessary to allow all dimensions to influence one another
- Authors used 4
- Problem)
- The Jacobian determinant is always 1.
- Why?)
- Consider the Additive Coupling Layer below.
- \(\displaystyle\frac{\partial y_{I_2}}{\partial x_{I_2}} = I_{D-d}\).
- Consider the Additive Coupling Layer below.
- Why?)
- Recall that the NF model’s capability was contracting the data into the easy-to-handle distributed latent variable.
- However, Jacobian determinant of 1 means the model cannot contract the data into the intended way.
- Authors resolve this issue with rescaling.
- The Jacobian determinant is always 1.
Concept) Additive Coupling Layer
- Desc.)
- Simple implementation of General Coupling Layer using the additive coupling law of
- \(g(a;\;b) = a+b\).
- Then the mapping goes
- \(y_{I_2} = x_{I_2} + m(x_{I_1})\).
- \(x_{I_2} = y_{I_2} - m(y_{I_1})\).
- Simple implementation of General Coupling Layer using the additive coupling law of
- Prop.)
- Only as expensive as computing the transformation itself.
- No restriction on choosing \(m\).
- e.g.) a neural network with \(d\). input and \(D-d\). output
- i.e.) \(y_{I_2} = x_{I_2} + \text{act}(y_{I_1}\mathbf{W} + \mathbf{b})\). where \(\text{act}\). is an activation function such as GeLU.
- e.g.) a neural network with \(d\). input and \(D-d\). output
- Unit Jacobian determinant
- Why?) \(\displaystyle\frac{\partial y_{I_2}}{\partial x_{I_2}} = I_{D-d}\).
3.3 Allowing Rescaling
- Goal)
- Recall Coupling Layer’s incapability of contracting the data.
- We want to artificially achieve this with the rescaling.
- How?)
- Let \(S\). be a diagonal matrix added at the top layer s.t.
- cf.) \(S=[S_{ii}]_{i\le D}\). is a learnable parameter!
- Then \(S_{ii}\). will be multiplied to the \(i\).-th dimension.
- i.e.) \((x_i)_{i\le D} \rightarrow (S_{ii}x_i)_{i\le D}\).
- Thus, the NICE criterion can be rewritten as
- \(\log(p_X(x)) = \displaystyle\sum_{i=1}^D \left[ \log(p_{H_i}(f_i(x))) + \log\left\vert S_{ii} \right\vert \right]\).
- Let \(S\). be a diagonal matrix added at the top layer s.t.
- Prop.)
- Larger the \(S_{ii}\). is, the less important the dimension \(i\). is.
- The prior term encourages \(S_{ii}\). to be small, while the determinant term \(\log S_{ii}\). prevents \(S_{ii}\). from ever reaching 0.
3.4 Prior Distribution
- Desc.)
- Recall that we choose the prior to be factorial as…
- \(p_H(h) = \prod_{i=1}^D p_{H_d}(h_d)\).
- General choice
- Gaussian
- \(\log(p_{H_d}) = -\frac{1}{2}(h_d^2 + \log(2\pi))\).
- Logistic Distribution
- \(\log(p_{H_d}) = -\log(1+\exp(h_d)) - \log(1+\exp(-h_d))\).
- Gaussian
- Recall that we choose the prior to be factorial as…
4 Optimization (NEEDS VERIFICATION)
- MLE maximization problem
- \(\displaystyle\arg\max_{S, m, \theta} \log(p_X(x)) = \displaystyle\sum_{i=1}^D \bigg[ \log(p_{H_i}(f_i(x))) + \log\left\vert S_{ii} \right\vert \bigg]\).
Enjoy Reading This Article?
Here are some more articles you might like to read next: