# Recap

## Maximum Likelihood

As we acquire more data, we can safely consider more complex hypotheses. The approach that we have considered to finding parameters for far is a maximum likelihood approach. The probability of our data given the model is maximized with respect to the parameters.

$$\mathop{\arg\min}_{\theta} p(D | \theta, m)$$

This can be done the same way as for least squares in linear regression -- taking the gradient with respect to $\theta$ and setting equal to zero.

## Regularization

Using the maximum likelihood approach we run the risk over voerfitting if we have too many parameters. We can change the function we are optimizing to penalize complexity.

$f(\theta) = L_\theta (x, y) + \lambda R(\theta)$

where $\lambda$ is the regularization coefficient and controls the tradeoff between goodness of fit and function simplicity.

## L1 and L2 regularization

Regularization penalize complexity in different ways. $L1$ penalizes the number of parameters used, while $L2$ penalizes the number of parameters squared.

L2 function: $$min \frac{1}{n} ||y- w^{T}x||^{2} + \lambda ||w||^{2}_{2}$$

L1 function: $$min \frac{1}{n} || y - w^{T}x||^{2} + \lambda ||w||_{1}$$

These regularizers correspond to Gaussian and Laplacian priors on the weights. L1 regularization also leads to shrinkage of the weight values towards zero. Still opimizing.

## Bayesian Methods

Bayesian methods provide a coherent framework for reasoning about our beliefs in the face of uncertainty.

$$p(\theta | D) = \frac{P(D|\theta)P(\theta)}{P(D)}$$

where, $P(\theta)$ is our prior beliefs about the state of the world, $\theta$; $P(D|\theta)$ is the probability of observations, $D$, given a particular state, $\theta$; $P(\theta|D)$ is our updated beliefs about the state of the world, $\theta$, given the observations, $D$.

## Marginal Likelihoods

We use marginal likelihoods to evaluate cluster membership. The marginal likelihood is defined as:

$$P(D|m) = \int P(D|\theta, m) P(\theta|m) d\theta$$

and can be interpreted as the probability that all data points in $D$ were generated from the same model with unknown parameters $\theta$.

Marginal likelihood is used to compare cluster models:

$$p(m_2 | D) = \frac{p(D|m_2)}{P(D|m_1)+P(D|m_2)}$$

## Nonparametric Bayesian Models

• How do we know which clustering models to compare
• Large numberso f model comparisons are costly
• Nonparametric Bayesian methods provide flexible priors for clustering models
• Allow us to infer the "right" number of clusters for our data
• Parametric models assume that some finite set of parameters, or clusters, capture evertyhing there is to know about the data
• The complexity of the model is bounded
• Nonparametric models assume that an inifinite set of parameters is needed
• The amount of information captured grows as the data grows