Probability and Information Theory
Probability theory is a mathematical framework for representing uncertain statements. In AI we use probability theory in two major ways. First, the laws of probability tells us how AI systems should reason, so we design our algorithms to compute or approximate various expressions derived using probability theory. Second , we can use probability and statistic to theoretically analyze the behavior of proposed AI systems.
While probability theory allows us to make uncertain statements and reason in the presence of uncertainty, information allows us to quantify the amount of uncertainty in a probability distribution.
Machine learning must always deal with uncertain quantities, and sometimes may also need to deal with stochastic quantities. There are three possible sources of uncertainty:
- Inherent stochasticity in the system being modeled.
- Incomplete observability
- Incomplete modeling
In many cases, it is more practical to use a simple but uncertain rule rather than a complex but certain one, even if the true rule is deterministic and our modeling system has the fidelity to accommodate a complex rule.
Given that we need a means of representing and reasoning about uncertainty. We use probability to represent a degree of belief, with 1 indicating absolute certainty and 0 indicating absolute impossible certainty. This probability related to qualitative levels of certainty, is known as Bayesian probability.
Probability can be seen as the extension of logic to deal with uncertainty. Logic provides a set of formal rules for determining what propositions are implied to be true or false given the assumption that some other sets of propositions is true or false. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.
A random variable is a variable that can take on different values randomly. Random variables may be discrete or continuous.
A * probability distributions* is a description of how likely a random variable or set of random variables is to take on each of its possible states.
Discrete Variable and Probability Mass Functions
A probability distribution over discrete variables may be described using a probability mass function (PMF).
The PMF maps from a state of a random variable to the probability of that random variable taking on the state. PMF can act on many variables at the same time. Such a probability distribution over many variables is known as a joint probability distribution.
Continuous Variables and Probability Density Functions
When working with continuous random variables, we describe probability distributions using a probability density function (PDF) rather than a PMF. To be a PDF, a function
p must satisfy the following properties:
- The domain of
pmust be the set of all possible states of x
- ∀ x∈x, p(x)≧0. Note that we do not require p(x) ≦ 1.
- ∫p(x)dx = 1
A PDF does not give the probability of a specific state directly, instead the probability of landing inside an infinitesimal region with volume σx is given by p(x)σx.
The probability distribution over the subset is known as the marginal probability distribution.
We can find P(x) with the sum rule if we have discrete random variables x and y:
For continuous variables, we need to use integration instead of summation:
In many cases, we are interested in the probability of some event, given that some other event has happened. This conditional probability can be computed with the formula:
The Chain Rule of Conditional Probabilities
Any joint probability distribution over many random variables may be decomposed into conditional distributions over only one variable:
This observation is known as the chain rule or product rule of probability.
Independence and Conditional Independence
Two random variables x and y are independent if their probability distribution can be expressed as a product of two factors, one involving only x and one involving only y.
We can denote independence and conditional independence with compact notation:
x⊥y means that x and y are independent, while
x⊥y|z means that x and y are conditionally independent given z.
Expectation, Variance and Covariance
The expectation or expected value of some function f(x) with respect to a probability distribution P(x) is the average or mean value that f takes on when x is drawn from P. For discrete variables this can be computed with a summation:
while for continuous variables, it is computed with an integral:
Expectations are linear.
The variance gives a measure of how much the values of a function of a random variable x vary as we sample different values of x from its probability distribution:
When the variance is low, the values of f(x) cluster near their expected value. The square root of the variance is known as the standard deviation.
The convariance gives some sense of how much two values are linearly related to each other, as well as the scale of these variables:
High absolute values of the covariance mean that the values change very much and are both far from their respective means at the same time. If the sign of the covariance is positive, then both variables tend to take on relatively high values simultaneously. If the sign of the covariance is negative, then one variable tends to take on a relatively high value at the times that the other takes on a relatively low value and vice versa. Other measures such as correlation normalize the contribution of each variable in order to measure only how much the variables are related, rather than also being affected by the scale of the separate variables.
Two variables that are independent have zero cocvariance, and two variables that have non-zero convariance are dependent. For two variables to have zero covarriance, there must beno linear dependence between them. It is possible taht two variables to be dependent but have zero convariance.
The covariance matrix of a random vector x∈Rn is an n×n matrix, such that:
The diagonal elements of the covariance give the variance:
Common Probability Distribution
Several simple probability distributions are useful in many contexts in machine learning.
Bernoulli distribution is a distribution over a single binaryrandom variable. It is controlled by a single parameter Φ∈[0, 1], which gives the probability of the random variable being equal to 1. It has following properties:
The multinoulli or categorical distribution is a distribtion over a single discrete variable with k different states, where k is finite. Multinoulli distributions are often used to refer to distributions over categories of objects.We do not usually need to compute the expectation or variance of multinoulli-distributed random variables.
Bernoulli and Multinoulli distribution are sufficient to describe any distribution over their domain. This is because they model discrete variables for which it is feasible to simply enumerate all of the states. When dealing with continuous variables, there are uncountably many states, so any distribution described by a small number of parameters must impose strict limits on the distribution.
The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:
The two parameters μ∈R and σ∈(0, ∞) control the normal distribution. The parameter μ gives the coordinate of the central peak. This is also the mean of distribution
E[x] = μ. The standard deviation of the distribution is given by σ, and the variance by σ².
When you evaluate the PDF, we need to square and invert σ. When we need to frequently evaluate the PDF with different parameter values, a more efficient way of parameterizing the distribution is to use a parameter β∈(0, ∞) to control the precision or inverse variance of the distribution:
Normal distributions are a sensible choice for many applications. The normal distribution is a good default choice for two major reasons.
First, many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed. This means that in practice, many complicated systems can be modeled successfully as normally distributed noise, even if the system can be decomposed into parts with more structured behavior.
Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over real numbers. We can thus think of the normal distribution as being the one that insets the least amount of prior knowledge into a model.
The normal distribution generalizes to Rn, in which case it is known as the multivariate normal distribution. It maybe parameterized with a positive definite symmetric matrix Σ:
The parameter μ still gives the mean of he distribution, though now it is vector-valued. The parameter Σ gives the covariance matrix of the distribution. As in the univariate case, when we wish to evaluate the PDF several times for many different values of the parameters, the covariance is not a computationally efficient way to parametrize the distribution, since we need to invert Σ to evaluate the PDF. We can instead use a precision matrix β:
We often fix the covariance matrix to be a diagonal matrix. An even simpler version is the isotropic Gaussian distribution, whose covariance matrix is a scalar times the identity matrix.