(Presentation PDF) A Style-Based Generator Architecture for Generative Adversarial Networks (Style GAN)
@karrasStyleBasedGeneratorArchitecture2019
History
- GAN
- Latent Space and Latent Variable
- DCGAN (2015)
- Introduced
- Convolution (CNN)
- Latent space
- Introduced
- InfoGAN (2016)
- Disentangled the latent factor
- $z = (z_{\text{noise}}, z_{\text{latent}})$.
- where
- $z_{\text{noise}}$ : incompressible noise
- Unstructured randomness that provides sample diversity
- $z_{\text{latent}}$ : interpretable latent code
- Semantically meaningful factors of variation we want to learn
- e.g.) From Info GAN
- $z_{\text{noise}}$ : incompressible noise
- where
- $z = (z_{\text{noise}}, z_{\text{latent}})$.
- Disentangled the latent factor
- DCGAN (2015)
- High-Resolution Image Synthesis
- Progressive GAN (2018)
- Progressive Growing :
- Stabilized generating high-resolution 1024×1024 photo realistic image
- $(8\times8)\rightarrow(16\times16)\rightarrow\cdots\rightarrow(1024\times1024)$.
- Stabilized generating high-resolution 1024×1024 photo realistic image
- Introduced FID as a GAN performance score
- From ProGAN to Style-GAN
- Progressive Growing :
- Progressive GAN (2018)
- Latent Space and Latent Variable
- Style Transfer
- AdaIN (2017)
2. Style-based generator
- Model)
- $\mathbf{z} \in \mathcal{Z}$ : a latent code
- $\dim(\mathcal{Z}) = 512$.
- $f$ : Mapping network
- $f(\mathbf{z}) = \mathbf{w}$ :
- where
- $f:\mathcal{Z}\rightarrow\mathcal{W}$ : a non-linear mapping network
- $f$ is implemented using 8-layer MLP
- $\dim(\mathcal{W}) = 512$.
- The truncation trick is performed at $\mathcal{W}$, not $\mathcal{Z}$
- where
- $A_{\ell}(\mathbf{w}) = { (\mathbf{y}{s,\ell, i}, \mathbf{y}{b,\ell, i}) }{i=1}^{C\ell}$ : a learned affine transformation inputted into the $\ell$-th AdaIN layer
- where
- $A_\ell : \mathcal{W} \rightarrow \mathcal{Y}_\ell$.
- $(\mathbf{y}{s,\ell, i}, \mathbf{y}{b,\ell, i})$ : the styles
- where
- $\mathbf{y}_{s,\ell, i}$ : $\ell$-th layer and $i$-th channel’s scale parameter
- $\mathbf{y}_{b,\ell, i}$ : $\ell$-th layer and $i$-th channel’s bias parameter
- where
- where
- $f(\mathbf{z}) = \mathbf{w}$ :
- $g$ : Synthesis network
- $\text{AdaIN}(\mathbf{x}i, \mathbf{y}) = \displaystyle\mathbf{y}{s,i}\,\frac{\mathbf{x}i - \mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)} + \mathbf{y}{b,i}$ : Adaptive Instance Normalization
- where
- $\mathbf{x}_i$ : the $i$-th channel (feature map) of the activation $\mathbf{x}$
- Desc.)
- AdaIN is a concept from style transfer.
- The normalization $\left(\displaystyle\frac{\mathbf{x}_i - \mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)}\right)$ removes the original style, leaving only the content.
- The scale $\mathbf{y}{s,i}$ and bias $\mathbf{y}{b,i}$ reintroduce a new style.
- where
- $B_\ell$ : the noise input to the $\ell$-th AdaIN layer
- Desc.)
- Single-channel images consisting of uncorrelated Gaussian noise
- The noise image is broadcasted to all feature maps using learned per feature map (channel) scaling factors and then added to the output of the corresponding convolution
- Desc.)
- $\text{Const } 4\times4\times512$ : a learned constant tensor
- Desc.)
- In traditional GAN models, the latent code is the first input to the first convolution layer.
- However, the authors figured out that when AdaIN operation is added, the network no longer benefitted from feeding the latent code into the first convolution layer.
- Thus, the substituted the latent code with the $\text{Const } 4\times4\times512$.
- This constant tensor is also optimized during training via backpropagation.
- Desc.)
- $\text{AdaIN}(\mathbf{x}i, \mathbf{y}) = \displaystyle\mathbf{y}{s,i}\,\frac{\mathbf{x}i - \mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)} + \mathbf{y}{b,i}$ : Adaptive Instance Normalization
- $\mathbf{z} \in \mathcal{Z}$ : a latent code
3. Properties of the style-based generator
Concept) Style Localization
- Desc.)
- Style-GAN’s generator controls the image synthesis via scale-specific modifications to the styles.
- How?)
- The mapping network $f$ and the affine transformation $A$ draw samples for each style
- The synthesis network $g$ generates a novel image based on a collection of styles.
- How?)
- The effect of each style is localized in the network
- i.e.) Each style affects only one convolutional layer before being overridden by the next AdaIN
- i.e.) the specific aspect of the image
- Why?)
- Recall the AdaIN operation of $\text{AdaIN}(\mathbf{x}i, \mathbf{y}) = \displaystyle\mathbf{y}{s,i}\,\frac{\mathbf{x}i - \mu(\mathbf{x}_i)}{\sigma(\mathbf{x}_i)} + \mathbf{y}{b,i}$.
- It first normalizes each channel to zero mean and unit variance, and then applies scales and bias based on style $(\mathbf{y}{s,i}, \mathbf{y}{b,i})$.
- Thus, each channel has new statistics dictated by the style $\mathbf{y}$.
- cf.) These new statistics are independent of the original statistics of $\mathbf{x}_i$
- Why?) Normalization!
- cf.) These new statistics are independent of the original statistics of $\mathbf{x}_i$
- Then, they modify the relative importance of features for the subsequent convolution operation.
- i.e.) Each style affects only one convolutional layer before being overridden by the next AdaIN
- Style-GAN’s generator controls the image synthesis via scale-specific modifications to the styles.
3.1 Style Mixing
- Goal)
- Regularization on styles
- Encourage style localizations
- Prevent the network from assuming that adjacent styles are correlated
- Regularization on styles
- How?)
- Utilize two latent codes $\mathbf{z_1,z_2}$.
- Set a crossover point in the synthesis network $g$.
- For AdaINs
- before the crossover point, use $A(\mathbf{w}_1) = A(f(\mathbf{z}_1))$
- after the crossover point, use $A(\mathbf{w}_2) = A(f(\mathbf{z}_2))$
- Result)
- Improved FID scores.
- Experiment)
-
Depending on the position of the crossover point we can transfer styles corresponding to
Level Portrait Example Coarse style ($4^2 - 8^2$) high-level aspects such as pose, general hair style, face shape, eyeglasses Middle style ($16^2 - 32^2$) smaller scale facial features, hair style, eye open/closed Fine style ($64^2 - 1024^2$) color scheme and microstructure
-
3.2 Stochastic Variation
Concept) Noise (Stochasticity)
- Desc.)
-
Analysis) Noise vs Style
Noise Style Effect only inconsequential stochastic variations global effects e.g. placement of hairs, stubble, freckles, etc changing pose, identity, etc. Why? the noise is added independently to each pixel Complete feature maps are scaled and biased with the same values - e.g.)
- Video Demonstration (Noise)
- Controlling the strength of the noise
- $B$ injected in the high resolution level $\rightarrow$ Coarse noise
- $B$ injected in the lowe resolution level $\rightarrow$ Fine noise
- Controlling the strength of the noise
- Video Demonstration (Noise)
4. Disentanglement studies
- e.g.)
- Desc.)
- Why features being curved in (b) matters?
- If so, the generator learns warped mapping to avoid missing area in the training example.
- Then the features get entangled, and we may not be able to control styles.
- By mapping $f:\mathcal{Z}\rightarrow\mathcal{W}$, we undo the warping.
- This disentangles the latent factors.
- Style GAN can control individual styles independently.
- Why features being curved in (b) matters?
- Desc.)
4.1 Perceptual path length
Enjoy Reading This Article?
Here are some more articles you might like to read next: