TESL Overview of Supervised Learning 1

We have used the more modern language of machine learning. In the statistical literature the inputs are often called the predictors, and more classically the independent variables. In the pattern recognition literature the term features is preferred. The outputs are called the responses, or classically the dependent variables.

Variable Type and Terminology

The distinction in output type has led to a naming convetion for the prediction tasks: regression when we predict quantitative outputs, and classification when we predict qualitative outputs.

Inputs also vary in measurement type, we can have some of each of qualitative and quantitave input variables.

A third variable type is ordered categorical.

We will typically denote an input variable by the symbol X. If X is a vector, its components can be accessed by subscripts Xj. Quantitative outputs will be denoted by Y, and qualitative outputs by G. We use uppercase letters such as X, Y or G when referring to the generic aspects of a variable. Observed values are written in lowercase; hence the ith observed values of X is written as xi.

For the moment we can loosely state the learning task as follows: given the value of an input vector X, make a good prediction of the output Y, denoted by Y-hat. If Y takes values in IR then so should Y-hat; likewise for categorical outputs, G-hat should take values in the same set associated with G.

We need data to construct prediction rules, often a lot of it. We thus suppose we have available a set of measurements (xi, yi) or (xi, gi), i = 1, ..., N, known as the training data, with which to construct our prediction rule.

Two Simple Approaches to Prediction: Least Squares and Nearest Neighbors

Two simple but powerful prediction methods: the linear model fit by least squares and the k-nearest-neightbor prediction rule. The linear model makes huge assumptions about structure and yields stable but possibly inaccurate predictions. The method of k-nearest neighbors makes very mild structural assumptions: its predictions are often accurate but can be unstable.

Linear Models and Least Squares

Given a vector of inputs XT = (X1, X2, ..., Xp), we predict the output Y via the model:

2.1

The term β0-hat is the intercept (截距), also known as the bias in machine learning. Often it is convenient to include the constant variable 1 in X, include β0-hat in the vector of coefficients β-hat, and then write the linear model in vector form as an inner product:

2.2

where X^T denotes vector or matrix transpose. Y-hat is a scalar; in general Y-hat can be a K-vector, in which case β would be a p×K matrix of coefficients. In the (p+1)-dimensional input-output space, (X, Y-hat) represents a hyperplane. If the constant is included in X, then the hyperplane includes the origin and is a subspace; if not, it is an affine set(仿射集) cutting the Y-axis at the point (0, β0-hat). We assume that the intercept is included in β-hat.

Viewed as a function over the p-dimensional input space, f(X) = X^T×β is linear, and the gradient f'(X) = β is a vector in input space that points in the steepest uphill direction.

The most popular is the method of least squares. In this approach, we pick the coefficients β to minimize the residual sum of squares (余项平方和):

2.3

RSS(β) is a quadratic function of the parameters, and hence its minimum always exists, but may not be unique. It can be written as:

2.4

where X is an N×p matrix with each row an input vector, and y is an N-vector of the outputs in the training set. Differentiating w.r.t β we get the normal equations:

2.5

If XTX is nonsigular, then the unique solution is given by:

2.6

and the fitted value at the ith input xi is yi-hat = y(xi)-hat = (xi)T β-hat. At an arbitrary input x0 the prediction is y(x0)-hat = (x0)Tβ-hat. The entire fitted surface is characterized by the p parameters β-hat.

Figure2.1

Figure 2.1 shows a scatterplot of training data on a pair of inputs X1 and X2. The output class variable Ghas the values BLUE or ORANGE, and is represented as such in the scatterplot. There are 100 points in each of the two classes. The linear regression model was fit to these data, with the response Y coded as 0 or BLUE and 1 for ORANGE. The fitted values Y-hat are converted to a fitted class variable G-hat according to the rule:

2.7

The set of points in IR^2 classified as ORANGE corresponds to {x : xTβ-hat > 0.5}, indicated in Figure 2.1, and the two predicted classes are separated by the decision boundary {x : xTβ-hat = 0.5}.

Consider the two possible scenarios:

  • Scenario 1: The training data in each class were generated from bivariate Gaussian distributions with uncorrelated components and different means.
  • Scenario 2: The training data in each class came from a mixture of 10 low-variance Gaussian distributions, with individual means themselves distributed as Gaussian.

A mixture of Gaussians is best described in terms of the generative model. One first generates a discrete variable that determines which of the component Gaussians to use, and then generates an observation from the chosen density. The region of overlap is inevitable, and future data to be much more difficult to obtain.

Nearest-Neighbor Methods

Nearest-neighbor methods use those observations in the training set Τ closest in input space to x to form Y-hat. Specifically, the k-nearest neighbor fit for Y-hat is defined as follows:

2.8

where Nk(x) is the neighborhood of x defined by the k closest points xi in the training sample. Closeness implies a metric, which for the moment we assume is Eucilidean distance. We find the k observation with xi closest to x in input space, and average their responses.

Figure 2.2

In Figure 2.2 we use the same training data as in Figure 2.1, and use 15-nearest-neighbor averaging of the binary coded response as the method of fitting.

Thus Y-hat is the proportion of ORANGE's in the neighborhood, and so assigning class ORANGE to G-hat if Y-hat > 0.5 amounts to a majority vote in the neighborhood. The colored regions indicate all those points in input space classified as BLUE or ORANGE by such a rule, in this case found by evaluating the procedure on a fine grid in input space. We see that the decision boundaries that separate the BLUE from the ORANGE regions are far more irregular, and respond to local clusters where one class dominates.

Figure 2.3

Figure 2.3 shows the results for 1-nearest-neighbor classification: Y-hat is assigned the value yl of the closest point xl to x in the training data. Each point xi has an associated tile bounding the region for which it is the closest input point. For all points x in the tile, G-hat(x) = gi. The decision boundary is even more irregular than before.

The method of k-nearest-neighbor averaging is defined in exactly the same way for regression of a quantitative output Y, although k = 1 would be an unlikely choice.

A little though suggests that for k-nearest-neighbor fits, the error on the training data should be approximately an increasing function of k, and will always be for k = 1.

It appears that k-nearest-neighbor fits have a single parameter, the number of neighbors k, compared to the p parameters in least-squares fits. Although this is the case, we will see that the effective number of parameters of k-nearest neighbors in N/k and is generally bigger than p, and decreases with increasing k. If the neighborhoods were nonoverlapping, there would be N/k neighborhoods and we would fit one parameter in each neighborhood.

It is also clear that we cannot use sum-of-squared errors on the training set as a criterion for picking k, since we would always pick k = 1. It would seem that k-nearest-neighbor method would be more appropriate for the mixture Scenario 2 described above, while for Gaussian data the decision boundaries of k-nearest neighbors would be unnecessarily noisy.

From Least Squares to Nearest Neighbors

The linear decision boundary from least squares is very smooth, and apparently stable to fit. It doese appear to rely heavily on the assumption that a linear decision boundary is appropriate. It has low vaqriance and potentially high bias.

On the other hand, the k-nearest-neighbor procedures do not appear to rely on any stringent assumptions about the underlying data, and can adapt to any situation.

Each method has its own situations for which it works best; in particular linear regression is more appropriate for Scenario 1 above, while nearest neighbors are more suitable for Scenario 2. THe time has come to expose the oracle: First we generated 10 means mk from a bivariate Gaussian distribution N((1, 0)T, I) and labelled this class BLUE. Similarly, 10 more were drawn from N((0, 1)T, I) and labelled class ORANGE. Then for each class we generated 100 observations as follows: for each observation, we picked an *mk at random with probability 1/10, and then generated a N(mk, I/5), thus leading to a mixture of Gaussian cluster for each class.

Figure 2.4

Figure 2.4 shows the results of classifying 10000 new observations generated from the model. We compare the results for least squares and those for k-nearest neighbors for a range of values of k.