# Recap

## ML as a Bag of Tricks

- Fast special cases
- K-means
- Kernel Density Estimation
- SVMs
- Boosting
- Random Forests

- Extensible family
- Mixture of Gaussians
- Latent variable models
- Gaussian processes
- Deep neural nets
- Bayesian neural nets

## Regularization as a Bag of Tricks

- Fast special cases
- Early stopping
- Ensembling
- L2 Regularization
- Gradient noise
- Dropout
- Expectation-Maximization

- Extensible family
- Stochastic variational inference

## A Language of Models

Hidden Markov Models, Mixture of Gaussians, Logistic Regression, VAEs, Normalizing flows. These are simply examples from a composable language of probabilistic models.

## AI as a Bag of Tricks

- Russel and Norvig's parts of AI
- Machine Learning
- Natural Language processing
- Knowledge representation
- Automated reasoning
- Computer Vision
- Robotics

- Extensible family
- Deep probabilistic latent-variable models + decision theory

## Losses are log-likelihoods

- Squared loss is just unnormalized Normal log-pdf
- Cross-entropy now means Categorical log-pmf?
- Actual definition: $H(p, q) = -\sum_{x}p(x)logq(x)$

- Teacher forcing is just evaluating the likelihood of a sequential model: $p(x)=\prod_{i}p_{\theta}(x_i|x_{<i})$

# Generative Models

## Definition

- Discriminative: Trained to answer a single query, $p(class|image)$
- Generative: Trained to model data distribution too: $p(class, image)$ or simply $p(image)$
- Any distribution can be conditioned and sampled from
- Can do ancestral sampling if $p(x, z) = p(z)p(x|z)$
- Conditional probability is an extension of logic that tells us how to combine evidence automatically
- Generative models are composable. Useful for modeling and semi-supervised learning.
- Latent variables sometimes interpretable.

## Main Approaches

- Sequential Models: $p(x) = \prod_ip_\theta(x_i|x_{<i})$
- Pixel Recurrent Neural Networks

- Variational Autoencoders: $x = f_\theta(z)+\epsilon$
- Variational Inference
- Need to compute $p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{\intp_\theta(x|z')p(z')dz'}$
- Optimize a distribution $q_\phi(z|x)$ to match $p_\theta(z|x)$
- What if there is a latent variable z per data point, and global parameters
- Optimize each $q_\phi(z_i|x_i)$ to match each $p_\theta(z_i|x_i)$, then update $\theta$. Slow!

- Variational Autoencoder
- Train a recognition network to output approximately optimal variational distributions $p_\theta(z_i|x_i)$ given $x_i$
- Total freedom in designing recognition procedure
- Can be evaluated by how well it matches $p_\theta(z_i|x_i)$

- Consequences of using a recognition network
- Don't need to re-optimize $q(z|x)$ each time $\theta$ changes. Much faster!
- Recognition network won't necessary give optimal $\phi_i$
- Can have fast test-time inference
- Can train recognition network jointly with generator

- Variations Decoder
- Often, $p(x|z) = \mathcal{N}(x|f_\theta(z), diag(g_\theta(z)))$
- Final step has independence assumption, causes noisy samples, blurry means
- $p(x|z)$ can be anything: RNN, pixel RNN, real NVP, deconv net
- Decoder often looks like inverse of encoder
- Encoders can come from supervised learning
- Real-Valued None-Volume-Preserving Transformations (Real NVP)
- divides up variables into two parts, updates only one half with a scale and shift

- Variational Inference
- Normalized models: $x=f_\theta(z)$, $p(x)=p(z)|det(\triangledown f)|^{-1}$
- Flow as Euler integrators
- Middle layers look like $h_{t+1} = h_t + f(h_t, \theta_t)$
- Limit of smaller steps: $\frac{dh(t)}{dt} = f(h(t), \theta(t))$
- Normalizing Flows

- Flow as Euler integrators
- Implicit models (GANs): $x=f_\theta(z)$
- $x=G(z;\theta^{(G)})$
- Must be differentiable
- No invertibility requirement
- Trainable for any size of $z$
- Some guarantees require $z$ to have higher dimension than $x$
- Can make $x$ conditionally Gaussian given $z$ but need not do so

- Discriminator Strategy
- Optimal $D(x)$ for any $p_{data}(x)$ and $p_{model}(x)$ is always
- $D(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{model}(x)}
- Estimating this ratio using supervised learning is the key approximation mechanism used by GANs
- $J^{(D)} = -\frac{1}{2}E_{x~p_{data}}logD(x) - \frac{1}{2}E_{z}log(1-D(G(z)))$
- $J^(G) = - J^(D)$
- Equilibrium is a saddle point of the discriminator loss
- Resembles JS-divergence
- Generator minimizes the log-probability of the discriminator being correct

- Relation to VAE
- sample graphical model: $z\rightarrow x$
- VAEs have an explicit likelihood: $p(x|z)$
- GANs have no explicit likelihood
- likelihood-free models

- $x=G(z;\theta^{(G)})$

## Comparison

- Sequential Models: $p(x) = \prod_ip_\theta(x_i|x_{<i})$
- Pros: Exact likelihoods, easy to train
- Cons: O(N) layers to evaluate or sample, need to choose order

- Variational Autoencoder: $x = f_\theta(z)+\epsilon$
- Pros: Cheap to evaluate and sample, low-D latents
- Cons: Factorized likelihood gives noisy samples

- Explicitly normalized models: $x=f_\theta(z)$, $p(x)=p(z)|det(\triangledown f)|^{-1}$
- Pros: Exact likelihoods, easy to train
- Cons: Must cripple layers to maintain tractability, need huge models

- Implicit models: $x=f_\theta(z)$
- Pros: Cheap to sample, no factorization
- Cons: Hard to train, likelihood not available