# Recap

## Maximum Likelihood

As we acquire more data, we can safely consider more complex hypotheses. The approach that we have considered to finding parameters for far is a maximum likelihood approach. The probability of our data given the model is maximized with respect to the parameters.

$$ \mathop{\arg\min}_{\theta} p(D | \theta, m) $$

This can be done the same way as for least squares in linear regression -- taking the gradient with respect to $\theta$ and setting equal to zero.

## Regularization

Using the maximum likelihood approach we run the risk over voerfitting if we have too many parameters. We can change the function we are optimizing to penalize complexity.

$f(\theta) = L_\theta (x, y) + \lambda R(\theta)$

where \(\lambda\) is the *regularization coefficient* and controls the tradeoff between goodness of fit and function simplicity.

## L1 and L2 regularization

Regularization penalize complexity in different ways. $L1$ penalizes the number of parameters used, while $L2$ penalizes the number of parameters squared.

L2 function: $$ min \frac{1}{n} ||y- w^{T}x||^{2} + \lambda ||w||^{2}_{2} $$

L1 function: $$ min \frac{1}{n} || y - w^{T}x||^{2} + \lambda ||w||_{1}$$

These regularizers correspond to Gaussian and Laplacian priors on the weights. L1 regularization also leads to *shrinkage* of the weight values towards zero. Still opimizing.

## Bayesian Methods

Bayesian methods provide a coherent framework for reasoning about our beliefs in the face of uncertainty.

$$ p(\theta | D) = \frac{P(D|\theta)P(\theta)}{P(D)} $$

where, $P(\theta)$ is our prior beliefs about the state of the world, $\theta$; $P(D|\theta)$ is the probability of observations, $D$, given a particular state, $\theta$; $P(\theta|D)$ is our updated beliefs about the state of the world, $\theta$, given the observations, $D$.

## Marginal Likelihoods

We use marginal likelihoods to evaluate cluster membership. The marginal likelihood is defined as:

$$ P(D|m) = \int P(D|\theta, m) P(\theta|m) d\theta $$

and can be interpreted as the probability that all data points in $D$ were generated from the same model with unknown parameters $\theta$.

Marginal likelihood is used to compare cluster models:

$$ p(m_2 | D) = \frac{p(D|m_2)}{P(D|m_1)+P(D|m_2)} $$

## Nonparametric Bayesian Models

- How do we know which clustering models to compare
- Large numberso f model comparisons are costly

- Nonparametric Bayesian methods provide flexible priors for clustering models
- Allow us to infer the "right" number of clusters for our data

`Parametric models`

assume that some finite set of parameters, or clusters, capture evertyhing there is to know about the data- The complexity of the model is bounded

`Nonparametric models`

assume that an inifinite set of parameters is needed- The amount of information captured grows as the data grows