Recap

ML as a Bag of Tricks

  • Fast special cases
    • K-means
    • Kernel Density Estimation
    • SVMs
    • Boosting
    • Random Forests
  • Extensible family
    • Mixture of Gaussians
    • Latent variable models
    • Gaussian processes
    • Deep neural nets
    • Bayesian neural nets

Regularization as a Bag of Tricks

  • Fast special cases
    • Early stopping
    • Ensembling
    • L2 Regularization
    • Gradient noise
    • Dropout
    • Expectation-Maximization
  • Extensible family
    • Stochastic variational inference

A Language of Models

Hidden Markov Models, Mixture of Gaussians, Logistic Regression, VAEs, Normalizing flows. These are simply examples from a composable language of probabilistic models.

AI as a Bag of Tricks

  • Russel and Norvig's parts of AI
    • Machine Learning
    • Natural Language processing
    • Knowledge representation
    • Automated reasoning
    • Computer Vision
    • Robotics
  • Extensible family
    • Deep probabilistic latent-variable models + decision theory

Losses are log-likelihoods

  • Squared loss is just unnormalized Normal log-pdf
  • Cross-entropy now means Categorical log-pmf?
    • Actual definition: $H(p, q) = -\sum_{x}p(x)logq(x)$
  • Teacher forcing is just evaluating the likelihood of a sequential model: $p(x)=\prod_{i}p_{\theta}(x_i|x_{<i})$

Generative Models

Definition

  • Discriminative: Trained to answer a single query, $p(class|image)$
  • Generative: Trained to model data distribution too: $p(class, image)$ or simply $p(image)$
  • Any distribution can be conditioned and sampled from
  • Can do ancestral sampling if $p(x, z) = p(z)p(x|z)$
  • Conditional probability is an extension of logic that tells us how to combine evidence automatically
  • Generative models are composable. Useful for modeling and semi-supervised learning.
  • Latent variables sometimes interpretable.

Main Approaches

  • Sequential Models: $p(x) = \prod_ip_\theta(x_i|x_{<i})$
    • Pixel Recurrent Neural Networks
    • Pixel Recurrent Neural Networks
  • Variational Autoencoders: $x = f_\theta(z)+\epsilon$
    • Variational Inference
      • Need to compute $p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{\intp_\theta(x|z')p(z')dz'}$
      • Optimize a distribution $q_\phi(z|x)$ to match $p_\theta(z|x)$
      • What if there is a latent variable z per data point, and global parameters
      • Optimize each $q_\phi(z_i|x_i)$ to match each $p_\theta(z_i|x_i)$, then update $\theta$. Slow!
    • Variational Autoencoder
      • Train a recognition network to output approximately optimal variational distributions $p_\theta(z_i|x_i)$ given $x_i$
      • Total freedom in designing recognition procedure
      • Can be evaluated by how well it matches $p_\theta(z_i|x_i)$
    • Consequences of using a recognition network
      • Don't need to re-optimize $q(z|x)$ each time $\theta$ changes. Much faster!
      • Recognition network won't necessary give optimal $\phi_i$
      • Can have fast test-time inference
      • Can train recognition network jointly with generator
    • Variations Decoder
      • Often, $p(x|z) = \mathcal{N}(x|f_\theta(z), diag(g_\theta(z)))$
      • Final step has independence assumption, causes noisy samples, blurry means
      • $p(x|z)$ can be anything: RNN, pixel RNN, real NVP, deconv net
      • Decoder often looks like inverse of encoder
      • Encoders can come from supervised learning
      • Real-Valued None-Volume-Preserving Transformations (Real NVP)
        • divides up variables into two parts, updates only one half with a scale and shift
        • real NVP
  • Normalized models: $x=f_\theta(z)$, $p(x)=p(z)|det(\triangledown f)|^{-1}$
    • Flow as Euler integrators
      • Middle layers look like $h_{t+1} = h_t + f(h_t, \theta_t)$
      • Limit of smaller steps: $\frac{dh(t)}{dt} = f(h(t), \theta(t))$
      • Flows vectors
      • Normalizing Flows
  • Implicit models (GANs): $x=f_\theta(z)$
    • $x=G(z;\theta^{(G)})$
      • Must be differentiable
      • No invertibility requirement
      • Trainable for any size of $z$
      • Some guarantees require $z$ to have higher dimension than $x$
      • Can make $x$ conditionally Gaussian given $z$ but need not do so
    • Discriminator Strategy
      • Optimal $D(x)$ for any $p_{data}(x)$ and $p_{model}(x)$ is always
      • $D(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{model}(x)}
      • Estimating this ratio using supervised learning is the key approximation mechanism used by GANs
      • $J^{(D)} = -\frac{1}{2}E_{x~p_{data}}logD(x) - \frac{1}{2}E_{z}log(1-D(G(z)))$
      • $J^(G) = - J^(D)$
        • Equilibrium is a saddle point of the discriminator loss
        • Resembles JS-divergence
        • Generator minimizes the log-probability of the discriminator being correct
    • Relation to VAE
      • sample graphical model: $z\rightarrow x$
      • VAEs have an explicit likelihood: $p(x|z)$
      • GANs have no explicit likelihood
        • likelihood-free models

Comparison

  • Sequential Models: $p(x) = \prod_ip_\theta(x_i|x_{<i})$
    • Pros: Exact likelihoods, easy to train
    • Cons: O(N) layers to evaluate or sample, need to choose order
  • Variational Autoencoder: $x = f_\theta(z)+\epsilon$
    • Pros: Cheap to evaluate and sample, low-D latents
    • Cons: Factorized likelihood gives noisy samples
  • Explicitly normalized models: $x=f_\theta(z)$, $p(x)=p(z)|det(\triangledown f)|^{-1}$
    • Pros: Exact likelihoods, easy to train
    • Cons: Must cripple layers to maintain tractability, need huge models
  • Implicit models: $x=f_\theta(z)$
    • Pros: Cheap to sample, no factorization
    • Cons: Hard to train, likelihood not available