Variational Inference with Normalizing Flows
hozy Summary
- Start from the ELBO Loss.
- \(\log p_\theta(\mathbf{x})\ge-\mathbb{D}_{\text{KL}}\left[q_\phi(\mathbf{z}\mid\mathbf{x}) \Vert p(\mathbf{z}) \right] + \mathbb{E}_{q}\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) \right]=-\mathcal{F}(\mathbf{z})\).
- Simplify the loss with Normalizing Flow and LOTUS
- \(\begin{aligned} \mathbb{E}_q\left[ \log \frac{p_\theta(\mathbf{x}\mid\mathbf{z}) \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] &= \mathbb{E}_q\left[ \underbrace{\log p_\theta(\mathbf{x},\mathbf{z})}_{h_1} - \underbrace{\log q_\phi(\mathbf{z}\mid\mathbf{x})}_{h_2} \right] \\ &= \mathbb{E}_{q_0}[\log p_\theta(\mathbf{x},f(\mathbf{z}_0))] - \mathbb{E}_q\left[ \log q_\phi(\mathbf{z}\mid\mathbf{x}) \right] \end{aligned}\).
- Using Planar Flow or Radial Flow, we may further simplify the objective.
2 Amortized Variational Inference
- Settings)
- \(\mathbf{x}\). : observations
- \(\mathbf{z}\). : latent variables
- \(\theta\). : model parameters
Concept) Approximated Posterior Distribution for the latent
- Def.)
- \(q_\phi(\mathbf{z}\mid\mathbf{x})\).
- Derivation)
\(\begin{aligned} \log p_\theta(\mathbf{x}) &= \log\int p_\theta(\mathbf{x}\mid\mathbf{z}) p(\mathbf{z}) \text{d}\mathbf{z} \\ &= \log\int\frac{q_\phi(\mathbf{z}\mid\mathbf{x})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \; p_\theta(\mathbf{x}\mid\mathbf{z}) \; p(\mathbf{z}) \text{d}\mathbf{z} \\ &= \log\int\frac{p_\theta(\mathbf{x}\mid\mathbf{z}) \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \; q_\phi(\mathbf{z}\mid\mathbf{x}) \text{d}\mathbf{z} \\ &= \log \mathbb{E}_q\left[ \frac{p_\theta(\mathbf{x}\mid\mathbf{z}) \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] \\ &\ge \mathbb{E}_q\left[ \log \frac{p_\theta(\mathbf{x}\mid\mathbf{z}) \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] \\ &= \mathbb{E}_q\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) + \log \frac{ \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] \\ &= \mathbb{E}_q\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) \right] - \int \log \frac{q_\phi(\mathbf{z}\mid\mathbf{x})}{p(\mathbf{z})} q_\phi(\mathbf{z}\mid\mathbf{x})\text{d}\mathbf{z} \\ &= -\mathbb{D}_{\text{KL}}\left[q_\phi(\mathbf{z}\mid\mathbf{x}) \Vert p(\mathbf{z}) \right] + \mathbb{E}_{q}\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) \right] \\ &= -\mathcal{F}(\mathbf{z}) \end{aligned}\).- Other names
- Negative Free Energy \(\mathcal{F}\).
- ELBO
- Prop)
- Consists of two terms
- \(-\mathbb{D}_{\text{KL}}\left[q_\phi(\mathbf{z}\mid\mathbf{x}) \Vert p(\mathbf{z}) \right]\).
- KL-divergence between the approximated posterior (\(q_\phi(\mathbf{z}\mid\mathbf{x})\).) and the prior distribution \((p(\mathbf{z}))\).
- \(\mathbb{E}_{q}\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) \right]\).
- Reconstruction Error
- \(-\mathbb{D}_{\text{KL}}\left[q_\phi(\mathbf{z}\mid\mathbf{x}) \Vert p(\mathbf{z}) \right]\).
- Provides the unified objective function for \(\theta\). and \(\phi\).
- Consists of two terms
- Other names
Tech.) Variational Inference
- How?)
- Use…
- the ELBO loss
- mini-batch strategy
- stochastic gradient descent
- Use…
- Limit)
- Computational cost on calculating \(\nabla_\phi \mathbb{E}_{q}\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) \right]\).
- Choosing the richest, computationally feasible posterior \(q\).
- This paper tackles this problem!
- Practices)
- Stochastic Backpropagation
- Kingma et al. 2014, Stochastic Gradient Variational Bayes
- Two steps)
- Reparameterization
- \(q_\theta(z) \sim\mathcal{N}(z\mid \mu,\sigma^2)\).
- Backpropagation with Monte Carlo
- \(\nabla_\phi \mathbb{E}_{q}\left[ f_\theta(z) \right] \Leftrightarrow \mathbb{E}_{\mathcal{N}(\epsilon\mid0,1)}\left[ \nabla_\phi f_\theta(\mu+\sigma\epsilon) \right]\).
- Reparameterization
- Two steps)
- Kingma et al. 2014, Stochastic Gradient Variational Bayes
- Inference Networks
- Goal)
- Learn an inverse map from observations to latent variables
- Advantage)
- No need to compute per datapoint variational parameters
- Instead, compute global variational parameters \(\phi\). valid for both training and test time
- Amortizing the cost of inference by generalizing between the posterior estimates for all latent variables through parameters of the inference network
- e.g.)
- Diagonal Gaussian Densities
- \(q_\phi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\mathbf{z}\mid\mu_\phi(\mathbf{x}), \text{diag}(\sigma^2_\phi(\mathbf{x})))\).
- where \(\mu_\phi, \sigma^2_\phi\). are specified using deep neural network
- \(q_\phi(\mathbf{z}\mid\mathbf{x}) = \mathcal{N}(\mathbf{z}\mid\mu_\phi(\mathbf{x}), \text{diag}(\sigma^2_\phi(\mathbf{x})))\).
- Diagonal Gaussian Densities
- Goal)
- Deep Latent Gaussian Models (DLGM)
- Desc.)
- Deep directed graphical model with the \(L\). layers of Gaussian latent variables \(\mathbf{z}_l\).
- where \(l=1,2,\ldots, L\).
- Each layer of latent variables is dependent on the layer above in a non-linear way
- Joint Probability
- \(p(\mathbf{x}, \mathbf{z}_1,\ldots,\mathbf{z}_L) = p(\mathbf{x}\mid f_0(\mathbf{z}_1)) \displaystyle\prod_{l=1}^L p(\mathbf{z}_l \mid f_l(\mathbf{z}_{l+1}))\).
- where
- \(p(\mathbf{z}_L) = \mathcal{N}(\mathbf{0, I})\). : the prior over latent variables
- \(p_\theta(\mathbf{x}\mid\mathbf{z})\). : the observation likelihood as any appropriate distribution
- parameterized by a deep neural network \(\theta\).
- where
- \(p(\mathbf{x}, \mathbf{z}_1,\ldots,\mathbf{z}_L) = p(\mathbf{x}\mid f_0(\mathbf{z}_1)) \displaystyle\prod_{l=1}^L p(\mathbf{z}_l \mid f_l(\mathbf{z}_{l+1}))\).
- Deep directed graphical model with the \(L\). layers of Gaussian latent variables \(\mathbf{z}_l\).
- Desc.)
- Stochastic Backpropagation
3 Normalizing Flows
- Goal)
- We want to find the latent posterior \(q_\phi\). s.t.
- \(q_\phi(\mathbf{z}\mid\mathbf{x}) \approx p_\theta(\mathbf{z}\mid\mathbf{x}) \Leftrightarrow \mathbb{D}_{\text{KL}}(q\Vert p)\approx 0\).
- We want to find the latent posterior \(q_\phi\). s.t.
- Def.)
- Normalizing Flow
- A sequence of invertible mappings that transforms an initial simple density (flows) into a valid, highly flexible probability distribution.
- Normalizing Flow
3.1 Finite Flows
- Settings)
- \(f:\mathbb{R}^d\rightarrow\mathbb{R}^d\). : an invertible smooth mapping s.t.
- \(f^{-1} = g\)., i.e. \(g\circ f(\mathbf{z}) = \mathbf{z}\).
- Prop.)
- For a random variable \(\mathbf{z}\). and it’s distribution \(q\)., and \(\mathbf{z}' = f(\mathbf{z})\).
- \(q_{\text{new}}(\mathbf{z}') = q_{\text{old}}(\mathbf{z})\displaystyle\left\vert\text{det}\frac{\partial f^{-1}}{\partial \mathbf{z}'}\right\vert = q_{\text{old}}(\mathbf{z})\left\vert\text{det}\frac{\partial f}{\partial \mathbf{z}}\right\vert^{-1}\).
- For a random variable \(\mathbf{z}\). and it’s distribution \(q\)., and \(\mathbf{z}' = f(\mathbf{z})\).
- Prop.)
- \(f^{-1} = g\)., i.e. \(g\circ f(\mathbf{z}) = \mathbf{z}\).
- \(q_K(\mathbf{z})\). : the density obtained by successively transforming \(\mathbf{z}_0\). through a chain of \(K\). transformations \(f_K\).
- \(\mathbf{z}_K = f_K\circ\cdots\circ f_1(\mathbf{z}_0)\).
- Notation)
\(\begin{aligned} \mathbf{z}_K &= f_K(f_{K-1}(\cdots f_2(f_1(\mathbf{z_0})))) = f_K(f_{K-1}(\cdots f_2(\mathbf{z_1}))) & (\mathbf{z_1} = f_1(\mathbf{z_0})) \\ &\quad\vdots \\ &= f_K(f_{K-1}(\mathbf{z}_{K-2})) = f_K(\mathbf{z}_{K-1}) & (\mathbf{z}_{K-1} = f_{K-1}(\mathbf{z}_{K-2})) \\ \end{aligned}\).
- Notation)
- \(\ln q_K(\mathbf{z}_K) = \ln q_0 (\mathbf{z}_0) - \displaystyle\sum_{k=1}^K\ln\left\vert\text{det}\frac{\partial f_k}{\partial \mathbf{z}_{k-1}}\right\vert \quad\cdots\quad(A)\).
- Derivation)
\(\begin{aligned} \ln q_K(\mathbf{z}_K) &= \ln q_{K-1}(f_K(\mathbf{z}_{K-1})) \\ &= \ln \left( q_{K-1}(\mathbf{z}_{K-1}) \left\vert\text{det}\frac{\partial f_K}{\partial \mathbf{z}_{K-1}}\right\vert^{-1} \right) = \ln q_{K-1}(\mathbf{z}_{K-1}) - \ln \left\vert\text{det}\frac{\partial f_K}{\partial \mathbf{z}_{K-1}}\right\vert \\ &= \ln q_{K-2}(\mathbf{z}_{K-2}) - \ln \left\vert\text{det}\frac{\partial f_{K-1}}{\partial \mathbf{z}_{K-2}}\right\vert - \ln \left\vert\text{det}\frac{\partial f_K}{\partial \mathbf{z}_{K-1}}\right\vert = \cdots \\ &= \ln q_{0}(\mathbf{z}_{0}) - \sum_{k=1}^K \ln \left\vert\text{det}\frac{\partial f_k}{\partial \mathbf{z}_{k-1}}\right\vert \\ \end{aligned}\).
- Derivation)
- \(\mathbf{z}_K = f_K\circ\cdots\circ f_1(\mathbf{z}_0)\).
- \(f:\mathbb{R}^d\rightarrow\mathbb{R}^d\). : an invertible smooth mapping s.t.
Concept) The Law of the Unconscious Statistician (LOTUS)
- Thm.)
- \(\mathbb{E}_{q_K} \left[ h(\mathbf{z}_K) \right] = \mathbb{E}_{q_0} \left[ h(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \right]\).
- Why?)
\(\begin{aligned} \mathbb{E}_{q_K} \left[ h(\mathbf{z}_K) \right] &= \int h(\mathbf{z}_K) \; q_K(\mathbf{z}_K) \text{d} \mathbf{z}_K \\ &= \int h(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \; q_K(\mathbf{z}_K) \text{d} \mathbf{z}_K & (\because \mathbf{z}_K \triangleq f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \\ &= \int h(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \underbrace{\left( q_0 (\mathbf{z}_0) \left\vert\text{det}\frac{\partial f}{\partial \mathbf{z}_0}\right\vert^{-1} \right)}_{=q_K(\mathbf{z}_K)} \underbrace{\left( \left\vert\text{det}\frac{\partial f}{\partial \mathbf{z}_0}\right\vert \text{d} \mathbf{z}_0 \right)}_{=\text{d} \mathbf{z}_K} & (\text{Put } f = f_K\circ\cdots\circ f_1) &\quad (\text{ cf. } \frac{\partial f}{\partial \mathbf{z}_0} = \frac{\partial f_K}{\partial \mathbf{z}_{K-1}}\frac{\partial f_{K-1}}{\partial \mathbf{z}_{K-2}}\cdots\frac{\partial f_1}{\partial \mathbf{z}_{0}} ) \\ &= \int h(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) q_0 (\mathbf{z}_0) \text{d} \mathbf{z}_0 \\ &= \mathbb{E}_{q_0} \left[ h(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \right] \end{aligned}\).
- Why?)
- \(\mathbb{E}_{q_K} \left[ h(\mathbf{z}_K) \right] = \mathbb{E}_{q_0} \left[ h(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \right]\).
- Meaning)
- Expectations w.r.t. the transformed \(q_K\). can be computed without explicitly knowing \(q_K\)..
- i.e.) If \(h(\mathbf{z})\). is independent on \(q_K\)., \(\mathbb{E}_{q_K}\). does not require calculating the Jacobian terms!
- Expectations w.r.t. the transformed \(q_K\). can be computed without explicitly knowing \(q_K\)..
- Application)
- Recall from the ELBO loss that
\(\begin{aligned} \mathbb{E}_q\left[ \log \frac{p_\theta(\mathbf{x}\mid\mathbf{z}) \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] &= \mathbb{E}_q\left[ \underbrace{\log p_\theta(\mathbf{x},\mathbf{z})}_{h_1} - \underbrace{\log q_\phi(\mathbf{z}\mid\mathbf{x})}_{h_2} \right] \end{aligned}\). - \(h_1\). can be simplified using LOTUS as…
\(\begin{aligned} \mathbb{E}_{q_K}[\log p_\theta(\mathbf{x},\mathbf{z})] &= \int \log p_\theta(\mathbf{x},\mathbf{z}) q(\mathbf{z}_k) \text{d} \mathbf{z}_K \\ &= \int \log p_\theta(\mathbf{x},f(\mathbf{z}_0)) q(\mathbf{z}_k) \text{d} \mathbf{z}_K & (\because \mathbf{z} = f_K\circ\cdots\circ f_1(\mathbf{z}_0)) \\ &= \int \log p_\theta(\mathbf{x},f(\mathbf{z}_0)) q(\mathbf{z}_0) \text{d} \mathbf{z}_0 & (\because \text{LOTUS}) \\ &= \mathbb{E}_{q_0}[\log p_\theta(\mathbf{x},f(\mathbf{z}_0))] \\ \end{aligned}\).- cf.) No need to calculate the Jacobian determinants.
- \(h_2\). needs Jacobian determinant calculation.
- why?)
- The transformation \(h_2=q_\phi\)., which is the \(q_K\). itself.
- why?)
- Recall from the ELBO loss that
3.2 Infinitesimal Flows
- Desc.)
- The length of the normalizing flow tends to infinity.
- Still finite!
- Utilize the partial differential equation to describe how \(q_0(\mathbf{z}_0)\). evolves over time(t)
- i.e.) \(\displaystyle\frac{\partial}{\partial t} q_t(\mathbf{z}_t) = \mathcal{T}_t[q_t(\mathbf{z}_t)]\).
- The length of the normalizing flow tends to infinity.
- e.g.)
- Langevin Flow
- Def.)
- Langevin SDE
- \(\text{d}\mathbf{z}(t) = \mathbf{F}(\mathbf{z}(t), t)\text{d}t + \mathbf{G}(\mathbf{z}(t),t)\text{d}\xi(t)\).
- where
- \(\text{d}\xi(t)\). : a Wiener process with
- \(\mathbb{E}[\xi(t)] = 0\).
- \(\mathbb{E}[\xi_i(t)\xi_j(t')] = \delta_{i,j}\delta(t-t')\).
- with
- the Kronecker delta \(\delta_{i,j} = \begin{cases} 1&\text{if } i=j\\ 0&\text{otherwise} \end{cases}\).
- the Dirac delta \(\delta(t-t') = \begin{cases} \infty&\text{if } t=t'\\ 0&\text{otherwise} \end{cases}\).
- with
- \(\mathbf{F}\). : the drift vector
- \(\mathbf{D} = \mathbf{GG}^\top\). : the diffusion matrix
- \(\text{d}\xi(t)\). : a Wiener process with
- where
- \(\text{d}\mathbf{z}(t) = \mathbf{F}(\mathbf{z}(t), t)\text{d}t + \mathbf{G}(\mathbf{z}(t),t)\text{d}\xi(t)\).
- Putting \(q_t(\mathbf{z})\). to be the probability distribution of \(\mathbf{z}\)., we may get the Fokker-Planck eqation as
- \(\displaystyle\frac{\partial}{\partial t} q_t(\mathbf{z}) = -\sum_i \frac{\partial}{\partial z_i}[F_i(\mathbf{z}, t)q_t] + \frac{1}{2}\sum_{i,j}\frac{\partial^2}{\partial z_i \partial z_j} [D_{i,j}(\mathbf{z}, t)q_t]\).
- Langevin SDE
- Usage)
- In ML, we use
- \(F(\mathbf{z},t) = -\nabla_z\mathcal{L}(\mathbf{z})\).
- where \(\mathcal{L}(\mathbf{z})\). is the unnormalized log-density of the model
- \(G(\mathbf{z},t) = \sqrt{2}\delta_{i,j}\).
- \(F(\mathbf{z},t) = -\nabla_z\mathcal{L}(\mathbf{z})\).
- Sol.)
- Assuming the Boltzmann Distribution as \(t\rightarrow\infty\)., i.e., \(q_\infty(\mathbf{z})\varpropto e^{-\mathcal{L}(\mathbf{z})}\)., we may get the stationary solution for \(q_t(\mathbf{z})\). at \(t\rightarrow\infty\).
- In ML, we use
- Def.)
- Hamiltonian Flow
- Langevin Flow
4 Inference with Normalizing Flows
4.1 Invertible Linear-time Transformers
Concept) Planar Flows
- Settings)
- A neural network with…
- \(L\). : the number of hidden layers
- \(D\). : the hidden dimension of the hidden layers
- cf.) Invertible neural networks take \(O(LD^3)\). time to calculate the Jacobians
- A family of transformation s.t.
- \(f(\mathbf{z}) = \mathbf{z} + \mathbf{u}h(\mathbf{w}^\top\mathbf{z} + b)\).
- where
- \(\lambda = \{\mathbf{w, u}\in\mathbb{R}^D, b\in\mathbb{R}\}\). : free parameters
- \(h(\cdot)\). : a smooth element-wise non-linearity with derivative \(h'(\cdot)\).
- where
- \(f(\mathbf{z}) = \mathbf{z} + \mathbf{u}h(\mathbf{w}^\top\mathbf{z} + b)\).
- A neural network with…
- Jacobian Computation Tricks
- Using \(\psi(\mathbf{z}) = h'(\mathbf{w^\top z}+b)\mathbf{w}\)., we may get the lodget-Jacobian term as
- \(\displaystyle\left\vert\text{det}\frac{\partial f}{\partial\mathbf{z}}\right\vert = \left\vert\text{det}(\mathbf{I} + \mathbf{u\psi(z)}^\top)\right\vert = \left\vert 1 + \mathbf{u^\top\psi(z)} \right\vert\).
- Thus, we may get
- \(\ln q_K(\mathbf{z}_K) = \ln q_0(\mathbf{z}) - \displaystyle\sum_{k=1}^K\ln\left\vert 1 + \mathbf{u}_k^\top\psi_{k}(\mathbf{z}_{k-1}) \right\vert\quad\cdots\quad(A)\).
- where \(\mathbf{z}_K = f_K\circ\cdots\circ f_1(\mathbf{z}_0)\).
- \(\ln q_K(\mathbf{z}_K) = \ln q_0(\mathbf{z}) - \displaystyle\sum_{k=1}^K\ln\left\vert 1 + \mathbf{u}_k^\top\psi_{k}(\mathbf{z}_{k-1}) \right\vert\quad\cdots\quad(A)\).
- Using \(\psi(\mathbf{z}) = h'(\mathbf{w^\top z}+b)\mathbf{w}\)., we may get the lodget-Jacobian term as
- Desc.)
- \((A)\). modifies the initial density \(q_0\). by applying a series of contractions and expansions in the direction perpendicular to the hyperplane \(\mathbf{w^\top z}+b=0\).
- \(O(D)\). : linear lodget-Jacobian computation
- Bimodal distribution
- Invertible
Concept) Radial Flows
- Settings)
- A family of transformation s.t.
- \(f(\mathbf{z}) = \mathbf{z}+\beta h(\alpha, r)(\mathbf{z}-\mathbf{z}_0)\).
- where
- \(\mathbf{z}_0\). : a reference point
- \(r = \vert \mathbf{z} - \mathbf{z}_0 \vert\).
- \(h(\alpha, r) = \displaystyle\frac{1}{\alpha+r}\).
- \(\lambda = \{\mathbf{z_0}\in\mathbb{R}^D, \alpha\in\mathbb{R}^+, \beta\in\mathbb{R}\}\). : free parameters
- where
- \(f(\mathbf{z}) = \mathbf{z}+\beta h(\alpha, r)(\mathbf{z}-\mathbf{z}_0)\).
- A family of transformation s.t.
- Jacobian Computation Tricks
- \(\displaystyle\left\vert\text{det}\frac{\partial f}{\partial\mathbf{z}}\right\vert = \left[ 1+\beta h(\alpha, r) \right]^{d-1} \; \left[ 1+\beta h(\alpha, r) + \beta h'(\alpha, r)r \right]\).
- Desc.)
- Radial contractions and expansions around the reference point
- \(O(D)\). : linear lodget-Jacobian computation
- Bimodal distribution
- Invertible
4.2 Flow-Based Free Energy Bound
- Goal)
- Specify the optimization target \(\mathcal{F}(\mathbf{x})\). : the free energy
- Settings)
- \(q_\phi(\mathbf{z\mid x}) := q_K(\mathbf{z}_K)\). : an approximated posterior distribution with the flow of length \(K\).
- Derivation)
- Recall the ELBO loss of
- \(\mathbb{E}_q\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) + \log \frac{ \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] = \mathbb{E}_q\left[ \log p_\theta(\mathbf{x},\mathbf{z}) - \log q_\phi(\mathbf{z}\mid\mathbf{x}) \right] = -\mathcal{F}(\mathbf{x})\).
- cf.) LOTUS
- \(\mathbb{E}_q\left[ \log p_\theta(\mathbf{x}\mid\mathbf{z}) + \log \frac{ \; p(\mathbf{z})}{q_\phi(\mathbf{z}\mid\mathbf{x})} \right] = \mathbb{E}_q\left[ \log p_\theta(\mathbf{x},\mathbf{z}) - \log q_\phi(\mathbf{z}\mid\mathbf{x}) \right] = -\mathcal{F}(\mathbf{x})\).
- Plugging in the Planar Flows, we may get
\(\begin{aligned} \mathcal{F}(\mathbf{x}) &= \mathbb{E}_{q_\phi(\mathbf{z}\mid\mathbf{x})}\left[ \log q_\phi(\mathbf{z}\mid\mathbf{x}) - \log p_\theta(\mathbf{x},\mathbf{z}) \right] \\ &= \mathbb{E}_{q_K(\mathbf{z}_K)}\left[ \log q_K(\mathbf{z}_K) \right] - \mathbb{E}_{q_0(\mathbf{z}_0)}\left[ \log p_\theta(\mathbf{x},\mathbf{z}_K) \right] & (\because\text{By def. of } \mathbf{z}_K \text{and LOTUS} ) \\ &= \mathbb{E}_{q_0(\mathbf{z}_0)}\left[ \ln q_0(\mathbf{z}) - \displaystyle\sum_{k=1}^K\ln\left\vert 1 + {\mathbf{u}_k^\top\psi_k(\mathbf{z}_{k-1})} \right\vert \right] - \mathbb{E}_{q_0(\mathbf{z}_0)}\left[ \log p_\theta(\mathbf{x},\mathbf{z}_K) \right] & (\because\text{Planar Flow}) \\ \end{aligned}\).
- Recall the ELBO loss of
- Optimization)
- \(\displaystyle\arg\max_{\phi,\theta} \text{ELBO} = \displaystyle\arg\min_{\phi,\theta} \mathcal{F}(\mathbf{x})\).
- Analysis)
- We may rewrite as
\(\begin{aligned} -\mathcal{F}(\mathbf{x}) &= \mathbb{E}_{q_{\phi}(\mathbf{z\mid x})}\left[ \ln p_\theta(\mathbf{x},\mathbf{z}) - \ln q_\phi(\mathbf{z}\mid\mathbf{x}) \right] \\ &= \mathbb{E}_{q_{\phi}(\mathbf{z\mid x})}\left[ \ln p_\theta(\mathbf{x},\mathbf{z}) \right] - \mathbb{E}_{q_{\phi}(\mathbf{z\mid x})}\left[ \ln q_\phi(\mathbf{z}\mid\mathbf{x}) \right] \\ &= \mathbb{E}_{q_{\phi}(\mathbf{z\mid x})}\left[ -\underbrace{\mathcal{L}(z,x)}_{\text{Energy}} \right] \underbrace{- \mathbb{E}_{q_{\phi}(\mathbf{z\mid x})}\left[ \ln q_\phi(\mathbf{z}\mid\mathbf{x}) \right]}_{\text{Entropy}} \\ \end{aligned}\).- cf.)
- \(F = E - TS\) : Refer to Free energy note
- why?)
- Energy (E) : converging to a certain point
- Recall that \(p(\mathbf{z}, \mathbf{x}) = \displaystyle\frac{e^{-\mathcal{L}(\mathbf{z}, \mathbf{x})}}{Z}\).
- where
- \(Z=\displaystyle\int e^{-\mathcal{L}(\mathbf{z}, \mathbf{x})}\text{d}\mathbf{z}\).
- \(\mathcal{L}(\mathbf{z}, \mathbf{x})\). is the energy of the latent \(\mathbf{z}\)., jointly distributed with the data \(\mathbf{x}\).
- i.e.) Higher chance (probability) that \(\mathbf{z}\). is at the low energy state.
- where
- Putting \(p(\mathbf{z}, \mathbf{x}) \varpropto e^{-\mathcal{L}(\mathbf{z}, \mathbf{x})}\)., we may rewrite as
- \(\mathcal{L}(\mathbf{z}, \mathbf{x}) = -\ln p(\mathbf{z}, \mathbf{x})\).
- i.e.) Energy = - log likelihood
- cf.) why \(p(\mathbf{z}, \mathbf{x})\).?
- The goal of ML is to learn \(p(\mathbf{z}\mid\mathbf{x}) = \displaystyle\frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\varpropto p(\mathbf{z}, \mathbf{x})\).
- \(\mathcal{L}(\mathbf{z}, \mathbf{x}) = -\ln p(\mathbf{z}, \mathbf{x})\).
- Recall that \(p(\mathbf{z}, \mathbf{x}) = \displaystyle\frac{e^{-\mathcal{L}(\mathbf{z}, \mathbf{x})}}{Z}\).
- Entropy (S) : dispersing to chaos
- By definition the entropy of the approximated posterior \(q_\phi\). is \(-\displaystyle\int q_\phi(\mathbf{z}) \ln q_\phi \; \text{d}\mathbf{z}\).
- Energy (E) : converging to a certain point
- Hence, the ELBO maximization problem is equivalent to…
- \(\mathcal{F}(\mathbf{x})\). minimization
- Energy minimization
- Entropy maximization
- cf.)
- We may rewrite as
- Analysis)
- \(\displaystyle\arg\max_{\phi,\theta} \text{ELBO} = \displaystyle\arg\min_{\phi,\theta} \mathcal{F}(\mathbf{x})\).
4.3 Algorithm Summary and Complexity
- Algorithm)
- Inference Time Complexity
- \(O(LN^2 + KD)\).
- where
- \(L\). : the number of deterministic layers used to map the data to the parameters of the flow
- cf.) the encoder depth that maintains the dimension \(D\).
- \(N\). : the average hidden layer size
- \(K\). : the flow-length
- \(D\). : the dimension of the latent variables
- \(L\). : the number of deterministic layers used to map the data to the parameters of the flow
- where
- \(O(LN^2 + KD)\).
5 Alternative Flow-based Posteriors
Concept) Volume Preserving Flows
- Goal)
- Its Jacobian determinant is equal to 1.
- Allow rich posterior distributions
- Types
- Finite
- Infinitesimal
Model) Non-linear Independent Components Estimation (NICE)
- Methods)
- Partition the latent vector into \(\mathbf{z} = (\mathbf{z}_A, \mathbf{z}_B)\).
- e.g.)
- \(\mathbf{z} = (\mathbf{z}_{1:d}, \mathbf{z}_{d+1:D})\).
- e.g.)
- Transformation
- \(f(\cdot)\). : neural network s.t.
- has easy to compute inverse \(g(\cdot)\).
- \(f(\mathbf{z}) = (\mathbf{z}_A, \mathbf{z}_B + h_\lambda(\mathbf{z}_A))\).
- \(g(\mathbf{z}') = (\mathbf{z}_A', \mathbf{z}_B' + h_\lambda(\mathbf{z}_A'))\).
- where
- \(h_\lambda\). is a neural network with parameters \(\lambda\).
- where
- has easy to compute inverse \(g(\cdot)\).
- \(f(\cdot)\). : neural network s.t.
- Alternation between \(\mathbf{z}_A\). and \(\mathbf{z}_B\).
- why?)
- To mix all components of the initial random variable \(\mathbf{z}_0\).
- why?)
- Resulting Density
- \(\ln q_K(f_K\circ\cdots\circ f_1(\mathbf{z}_0)) = \ln q_0(\mathbf{z}_0)\).
- \(\ln q_K(\mathbf{z}') = q_0(g_1\circ \cdots\circ g_K(\mathbf{z}'))\).
- Partition the latent vector into \(\mathbf{z} = (\mathbf{z}_A, \mathbf{z}_B)\).
- Props.)
- Jacobian with a zero upper triangular part resulting in a determinant of 1.
Model) Hamiltonian Variational Approximation (HVI)
Enjoy Reading This Article?
Here are some more articles you might like to read next: