TESL Introduction

TESL--The Elements of Statistical Learning

Supervised learning is it has the presence of the outcome variable to guide the learning process. In the unsupervised learning problem, we observe only the features and have no measurements of the outcome.

Example 1: Email Spam

The objective was to design an automatic spam detector that could filter out spam before clogging the users' mailboxes. This is a supervised learning problem, with the outcome the class variable email/spam. It is also called classification problem.

Average percentage of words in an email

Our learning method has to decide which features to use and how, we might use a rule such as:

  • if (%george < 0.6) & (%you >1.5) then spam else email
  • if (0.2 * %you - 0.3 * %george) > 0 then spam else email
Example 2: Prostate Cancer

The study examined the correlation between the level of prostate specific antigen (PSA) and a number of clinical measures, in 97 men who were about to receive a radical prostatectomy. The goal is to predict the log of PSA (lpsa) from a number of measurements including log cancer volume (lcavol), log prostate weight lweight, age, log of benign prostatic hyperplasia amount lbph, seminal vesicle invasion svi, log of capsular penetration lcp, Gleason score gleason, and percent of Gleason scores 4 or 5 pgg45.

Scatterplot matrix of the prostate cancer data

This is a supervised learning problem, known as a regression problem, because the outcome measurement is quantitative.

Example 3: Handwritten Digit Recognition

Each image is a segment from a five digit ZIP code, isolating a single digit. The images have been normalized to have approximately the same size and orientation. This is a classification problem for which the error rate needs to be kept very low to avoid misdirection of mail.

Example 4: DNA Expression Microarrays

DNA stands for deoxyribonucleic acid, and is the basic material that makes up human chromosomes. Microarrays are considered a breakthrough technology in biology, facilitating the quantitative study of thousands of genes simultaneously from a signle sample of cells.