Maximum Likelihood

As we acquire more data, we can safely consider more complex hypotheses. The approach that we have considered to finding parameters for far is a maximum likelihood approach. The probability of our data given the model is maximized with respect to the parameters.

$$ \mathop{\arg\min}_{\theta} p(D | \theta, m) $$

This can be done the same way as for least squares in linear regression -- taking the gradient with respect to $\theta$ and setting equal to zero.


Using the maximum likelihood approach we run the risk over voerfitting if we have too many parameters. We can change the function we are optimizing to penalize complexity.

$f(\theta) = L_\theta (x, y) + \lambda R(\theta)$

where \(\lambda\) is the regularization coefficient and controls the tradeoff between goodness of fit and function simplicity.

L1 and L2 regularization

Regularization penalize complexity in different ways. $L1$ penalizes the number of parameters used, while $L2$ penalizes the number of parameters squared.

L2 function: $$ min \frac{1}{n} ||y- w^{T}x||^{2} + \lambda ||w||^{2}_{2} $$

L1 function: $$ min \frac{1}{n} || y - w^{T}x||^{2} + \lambda ||w||_{1}$$

These regularizers correspond to Gaussian and Laplacian priors on the weights. L1 regularization also leads to shrinkage of the weight values towards zero. Still opimizing.

Bayesian Methods

Bayesian methods provide a coherent framework for reasoning about our beliefs in the face of uncertainty.

$$ p(\theta | D) = \frac{P(D|\theta)P(\theta)}{P(D)} $$

where, $P(\theta)$ is our prior beliefs about the state of the world, $\theta$; $P(D|\theta)$ is the probability of observations, $D$, given a particular state, $\theta$; $P(\theta|D)$ is our updated beliefs about the state of the world, $\theta$, given the observations, $D$.

Marginal Likelihoods

We use marginal likelihoods to evaluate cluster membership. The marginal likelihood is defined as:

$$ P(D|m) = \int P(D|\theta, m) P(\theta|m) d\theta $$

and can be interpreted as the probability that all data points in $D$ were generated from the same model with unknown parameters $\theta$.

Marginal likelihood is used to compare cluster models:

$$ p(m_2 | D) = \frac{p(D|m_2)}{P(D|m_1)+P(D|m_2)} $$

Nonparametric Bayesian Models

  • How do we know which clustering models to compare
    • Large numberso f model comparisons are costly
  • Nonparametric Bayesian methods provide flexible priors for clustering models
    • Allow us to infer the "right" number of clusters for our data
  • Parametric models assume that some finite set of parameters, or clusters, capture evertyhing there is to know about the data
    • The complexity of the model is bounded
  • Nonparametric models assume that an inifinite set of parameters is needed
    • The amount of information captured grows as the data grows

PPT: Introduction to Machine Learning