# Recap

## ML as a Bag of Tricks

• Fast special cases
• K-means
• Kernel Density Estimation
• SVMs
• Boosting
• Random Forests
• Extensible family
• Mixture of Gaussians
• Latent variable models
• Gaussian processes
• Deep neural nets
• Bayesian neural nets

## Regularization as a Bag of Tricks

• Fast special cases
• Early stopping
• Ensembling
• L2 Regularization
• Dropout
• Expectation-Maximization
• Extensible family
• Stochastic variational inference

## A Language of Models

Hidden Markov Models, Mixture of Gaussians, Logistic Regression, VAEs, Normalizing flows. These are simply examples from a composable language of probabilistic models.

## AI as a Bag of Tricks

• Russel and Norvig's parts of AI
• Machine Learning
• Natural Language processing
• Knowledge representation
• Automated reasoning
• Computer Vision
• Robotics
• Extensible family
• Deep probabilistic latent-variable models + decision theory

## Losses are log-likelihoods

• Squared loss is just unnormalized Normal log-pdf
• Cross-entropy now means Categorical log-pmf?
• Actual definition: $H(p, q) = -\sum_{x}p(x)logq(x)$
• Teacher forcing is just evaluating the likelihood of a sequential model: $p(x)=\prod_{i}p_{\theta}(x_i|x_{<i})$

# Generative Models

## Definition

• Discriminative: Trained to answer a single query, $p(class|image)$
• Generative: Trained to model data distribution too: $p(class, image)$ or simply $p(image)$
• Any distribution can be conditioned and sampled from
• Can do ancestral sampling if $p(x, z) = p(z)p(x|z)$
• Conditional probability is an extension of logic that tells us how to combine evidence automatically
• Generative models are composable. Useful for modeling and semi-supervised learning.
• Latent variables sometimes interpretable.

## Main Approaches

• Sequential Models: $p(x) = \prod_ip_\theta(x_i|x_{<i})$
• Pixel Recurrent Neural Networks
• Variational Autoencoders: $x = f_\theta(z)+\epsilon$
• Variational Inference
• Need to compute $p_\theta(z|x) = \frac{p_\theta(x|z)p(z)}{\intp_\theta(x|z')p(z')dz'}$
• Optimize a distribution $q_\phi(z|x)$ to match $p_\theta(z|x)$
• What if there is a latent variable z per data point, and global parameters
• Optimize each $q_\phi(z_i|x_i)$ to match each $p_\theta(z_i|x_i)$, then update $\theta$. Slow!
• Variational Autoencoder
• Train a recognition network to output approximately optimal variational distributions $p_\theta(z_i|x_i)$ given $x_i$
• Total freedom in designing recognition procedure
• Can be evaluated by how well it matches $p_\theta(z_i|x_i)$
• Consequences of using a recognition network
• Don't need to re-optimize $q(z|x)$ each time $\theta$ changes. Much faster!
• Recognition network won't necessary give optimal $\phi_i$
• Can have fast test-time inference
• Can train recognition network jointly with generator
• Variations Decoder
• Often, $p(x|z) = \mathcal{N}(x|f_\theta(z), diag(g_\theta(z)))$
• Final step has independence assumption, causes noisy samples, blurry means
• $p(x|z)$ can be anything: RNN, pixel RNN, real NVP, deconv net
• Decoder often looks like inverse of encoder
• Encoders can come from supervised learning
• Real-Valued None-Volume-Preserving Transformations (Real NVP)
• divides up variables into two parts, updates only one half with a scale and shift
• Normalized models: $x=f_\theta(z)$, $p(x)=p(z)|det(\triangledown f)|^{-1}$
• Flow as Euler integrators
• Middle layers look like $h_{t+1} = h_t + f(h_t, \theta_t)$
• Limit of smaller steps: $\frac{dh(t)}{dt} = f(h(t), \theta(t))$
• Normalizing Flows
• Implicit models (GANs): $x=f_\theta(z)$
• $x=G(z;\theta^{(G)})$
• Must be differentiable
• No invertibility requirement
• Trainable for any size of $z$
• Some guarantees require $z$ to have higher dimension than $x$
• Can make $x$ conditionally Gaussian given $z$ but need not do so
• Discriminator Strategy
• Optimal $D(x)$ for any $p_{data}(x)$ and $p_{model}(x)$ is always
• $D(x) = \frac{p_{data}(x)}{p_{data}(x) + p_{model}(x)} • Estimating this ratio using supervised learning is the key approximation mechanism used by GANs •$J^{(D)} = -\frac{1}{2}E_{x~p_{data}}logD(x) - \frac{1}{2}E_{z}log(1-D(G(z)))$•$J^(G) = - J^(D)$• Equilibrium is a saddle point of the discriminator loss • Resembles JS-divergence • Generator minimizes the log-probability of the discriminator being correct • Relation to VAE • sample graphical model:$z\rightarrow x$• VAEs have an explicit likelihood:$p(x|z)$• GANs have no explicit likelihood • likelihood-free models ## Comparison • Sequential Models:$p(x) = \prod_ip_\theta(x_i|x_{<i})$• Pros: Exact likelihoods, easy to train • Cons: O(N) layers to evaluate or sample, need to choose order • Variational Autoencoder:$x = f_\theta(z)+\epsilon$• Pros: Cheap to evaluate and sample, low-D latents • Cons: Factorized likelihood gives noisy samples • Explicitly normalized models:$x=f_\theta(z)$,$p(x)=p(z)|det(\triangledown f)|^{-1}$• Pros: Exact likelihoods, easy to train • Cons: Must cripple layers to maintain tractability, need huge models • Implicit models:$x=f_\theta(z)\$
• Pros: Cheap to sample, no factorization
• Cons: Hard to train, likelihood not available