Lecture 13: Supervised Learning

November, 2015

New survey

https://sebastianbarfort.typeform.com/to/ByHkQG

Today

Logit example

Supervised Learning

test/training data
regularization (Ridge/Lasso)
cross validation

Next time

classification
decision trees
unsupervised learning

Example: predicting gender from weight/height

Remember this?

Exercise: How did I calculate the line?

Supervised Learning

Introduction

supervised learning: Models designed to infer a relationship from labeled training data.

labelled data: For each observation of the predictor variables, \(x_i, 1,..., n\) there is an associated response measurement \(y_i\)

When the response measurement is discrete: classifiation
when the response is continuous: regression

Error

Statistical learning models are designed to minimize out of sample error: the error rate you get on a new data set

Key ideas

Out of sample error is what you care about
In sample error < out of sample error
The reason is overfitting (matching your algorithm to the data you have)

Out of sample error (continuous variables)

Mean squared error (MSE):

\[\frac{1}{n} \sum_{i=1}^{n} (\text{prediction}_i - \text{truth}_i)^2\]

Root mean squared error

\[\sqrt{\frac{1}{n} \sum_{i=1}^{n} (\text{prediction}_i - \text{truth}_i)^2}\]

Question: what is the difference?

Example: predicting age of death

library("readr")
df = read_csv("https://raw.githubusercontent.com/johnmyleswhite/ML_for_Hackers/master/05-Regression/data/longevity.csv")
head(df)

## Source: local data frame [6 x 2]
## 
##   Smokes AgeAtDeath
##    (int)      (int)
## 1      1         75
## 2      1         72
## 3      1         66
## 4      1         74
## 5      1         69
## 6      1         65

library("dplyr")
df %>% group_by(Smokes) %>% 
  summarise(mean.death = mean(AgeAtDeath))

## Source: local data frame [2 x 2]
## 
##   Smokes mean.death
##    (int)      (dbl)
## 1      0     75.254
## 2      1     70.192

Let's look at MSE for different guesses of age of death

guess.accuracy = data.frame()
for (guess in 63:83){
  prediction.error = mean((df$AgeAtDeath - guess)^2)
  guess.accuracy = rbind(guess.accuracy, 
                         data.frame(Guess = guess,
                                    Error = prediction.error))
}
head(guess.accuracy)

##   Guess   Error
## 1    63 127.451
## 2    64 109.005
## 3    65  92.559
## 4    66  78.113
## 5    67  65.667
## 6    68  55.221

library("ggplot2")
ggplot(guess.accuracy, aes(x = Guess, y = Error)) +
  geom_point() + geom_line() + theme_minimal()

Out of sample error (discrete variables)

One simple way to assess model accuracy when you have discrete outcomes (republican/democrat, professor/student, etc) could be the mean classification error

\[\text{Ave}(I(y_0 \neq \hat{y}_0))\]

But assessing model accuracy with discrete outcomes is often not straightforward.

You might also want to check out ROC curves

Test and training data

Accuracy on the training set (resubstitution accuracy) is optimistic

A better estimate comes from an independent set (test set accuracy)

But we can't use the test set when building the model or it becomes part of the training set

So we estimate the test set accuracy with the training set

Remember the bias-variance tradeoff

Example

We now simulate some data to illustrate the difference between test and training data accuracy

set.seed(1)
x = seq(0, 1, by = .01)
y = sin(2 * pi * x) + rnorm(length(x), 0, .1)
df = data.frame(x = x, y = y)

This is a sine wave with some noise

ggplot(data = df, aes(x = x, y = y)) + geom_line() + theme_minimal()

Test & training data

We now split the data into a test and training data set

n = length(x)
indices = sort(sample(1:n, round(.5*n)))

training.df = df[indices, ]
test.df = df[-indices, ]

Define a RMSE function

rmse = function(prediction, truth){
  return(sqrt(mean( (prediction - truth) ^2)))
}

We want to get the best polynomial fit for the data. We achieve this by looping over a set of polynomials

performance = data.frame()

for (d in 1:12){
  poly.fit = lm(y ~ poly(x, degree = d), data = training.df)
  performance = rbind(performance, 
                      data.frame(degree = d,
                                 data = "training",
                                 RMSE = rmse(prediction = predict(poly.fit), truth = training.df$y)))
  performance = rbind(performance,
                       data.frame(degree = d, 
                                  data = "test",
                                  RMSE = rmse(prediction = predict(poly.fit, newdata = test.df),
                                              truth = test.df$y)))
}
head(performance)

##   degree     data       RMSE
## 1      1 training 0.40622193
## 2      1     test 0.50453141
## 3      2 training 0.40115164
## 4      2     test 0.52051182
## 5      3 training 0.09954393
## 6      3     test 0.13162960

ggplot(performance, aes(x = degree, y = RMSE, colour = data)) + geom_point() +
  geom_line() + theme_minimal()

Question: what happens here?

for (d in 1:12){
  poly.fit = lm(y ~ poly(x, degree = d), data = training.df)
  training.df = cbind(training.df, predict(poly.fit))
  test.df = cbind(test.df, predict(poly.fit, newdata = test.df))
}
names(training.df) = c("x", "y", paste("degree:", 1:12, sep = " "))
names(test.df) = c("x", "y", paste("degree:", 1:12, sep = " "))

library("tidyr")
training.df = training.df %>% gather(degree, prediction, -x, -y)
test.df = test.df %>% gather(degree, prediction, -x, -y)

Training data

p = ggplot(training.df) + geom_point(aes(x = x, y = y), alpha = .3)  + 
  geom_line(aes(x = x, y = prediction), colour = "red") +
  facet_wrap(~ degree) + theme_minimal()

Test data

p = ggplot(test.df) + geom_point(aes(x = x, y = y), alpha = .3)  + 
  geom_line(aes(x = x, y = prediction), colour = "red") +
  facet_wrap(~ degree) + theme_minimal()

Regularization

The problem with overfitting comes from our model being too complex

Complexity: models are complex when the number or size of the coefficients is large

One approach: punish the model for doing this

This approach is called regularization

Ridge and Lasso regression

Two popular models build on this approach: Ridge and Lasso

The approach is similar: include a loss function in the OLS minimization problem to prevent overfitting

\[\sum_{i = 1}^{n}(y_i - \beta_0 - \sum_{j = 1}^{p} b_j x_{ij})^2 + \text{LOSS}\]

Ridge uses the L2 norm: \(\alpha \sum_{j = 1}^{p} \beta_{j}^{2}\)
Lasso uses the L1 norm: \(\alpha \sum_{j = 1}^{p} |\beta_{j}|\)

This turns out to be very important

L1 regularization gives you sparse estimates (and therefore performs some form of variable selection)

Back to our example…

We can examine our L1 and L2 norms as follows

lm.fit = lm(y ~ x, data = df)
l2.norm = sum(coef(lm.fit)^2)
l1.norm = sum(abs(coef(lm.fit)))
print(paste0("l2.norm is ", l2.norm))

## [1] "l2.norm is 4.35231437896382"

print(paste0("l1.norm is ", l1.norm))

## [1] "l1.norm is 2.80299809176307"

Fitting the Lasso

Regularization methods are implemented in R in the glmnet package (although it might also be worth checking out the newer caret package)

library("glmnet")
args(glmnet)

## function (x, y, family = c("gaussian", "binomial", "poisson", 
##     "multinomial", "cox", "mgaussian"), weights, offset = NULL, 
##     alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < 
##         nvars, 0.01, 1e-04), lambda = NULL, standardize = TRUE, 
##     intercept = TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 
##         2 + 20, nvars), exclude, penalty.factor = rep(1, nvars), 
##     lower.limits = -Inf, upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars < 
##         500, "covariance", "naive"), type.logistic = c("Newton", 
##         "modified.Newton"), standardize.response = FALSE, type.multinomial = c("ungrouped", 
##         "grouped")) 
## NULL

alpha controls the norm. alpha = 1 is the Lasso penalty, lasso = 0 is Ridge

Let's fit a Lasso model on our sinus wave data

library("glmnet")
x = as.matrix(cbind(x,rev(x)))
glmnet(x, y)

## 
## Call:  glmnet(x = x, y = y) 
## 
##       Df    %Dev   Lambda
##  [1,]  0 0.00000 0.542800
##  [2,]  1 0.09991 0.494600
##  [3,]  1 0.18290 0.450700
##  [4,]  1 0.25170 0.410600
##  [5,]  1 0.30890 0.374200
##  [6,]  1 0.35640 0.340900
##  [7,]  1 0.39580 0.310600
##  [8,]  1 0.42850 0.283000
##  [9,]  1 0.45570 0.257900
## [10,]  1 0.47820 0.235000
## [11,]  1 0.49690 0.214100
## [12,]  1 0.51250 0.195100
## [13,]  1 0.52540 0.177800
## [14,]  1 0.53610 0.162000
## [15,]  1 0.54500 0.147600
## [16,]  1 0.55240 0.134500
## [17,]  1 0.55850 0.122500
## [18,]  1 0.56360 0.111600
## [19,]  1 0.56780 0.101700
## [20,]  1 0.57130 0.092680
## [21,]  1 0.57420 0.084450
## [22,]  1 0.57670 0.076940
## [23,]  1 0.57870 0.070110
## [24,]  1 0.58030 0.063880
## [25,]  1 0.58170 0.058210
## [26,]  1 0.58290 0.053030
## [27,]  1 0.58380 0.048320
## [28,]  1 0.58460 0.044030
## [29,]  1 0.58530 0.040120
## [30,]  1 0.58580 0.036550
## [31,]  1 0.58630 0.033310
## [32,]  1 0.58660 0.030350
## [33,]  1 0.58700 0.027650
## [34,]  1 0.58720 0.025200
## [35,]  1 0.58740 0.022960
## [36,]  1 0.58760 0.020920
## [37,]  1 0.58780 0.019060
## [38,]  1 0.58790 0.017370
## [39,]  1 0.58800 0.015820
## [40,]  1 0.58810 0.014420
## [41,]  1 0.58810 0.013140
## [42,]  1 0.58820 0.011970
## [43,]  1 0.58830 0.010910
## [44,]  1 0.58830 0.009938
## [45,]  1 0.58830 0.009055
## [46,]  1 0.58840 0.008251
## [47,]  1 0.58840 0.007518
## [48,]  1 0.58840 0.006850
## [49,]  1 0.58840 0.006241
## [50,]  1 0.58840 0.005687
## [51,]  1 0.58840 0.005182
## [52,]  1 0.58840 0.004721
## [53,]  1 0.58850 0.004302
## [54,]  1 0.58850 0.003920
## [55,]  1 0.58850 0.003571

`glmnet` output

The output contains three columns

Df: tells you how many coefficients in the model ended up being nonzero
%Dev: essentially a R2 for the model
Lambda: the loss parameter

Because Lambda controls the values that we get from the model, it is often referred to as a hyperparameter

Large Lambda: heavy penalty for model complexity

Picking `Lambda`

Lambda should be picked by the data! That is, which Lambda minimizes RMSE in our test data

library('glmnet')
training.df = df[indices, ]
test.df = df[-indices, ]
glmnet.fit = glmnet(poly(training.df$x, degree = 10), training.df$y) 
lambdas = glmnet.fit$lambda
performance = data.frame()
for (lambda in lambdas){
  performance = rbind(performance,
                      data.frame(lambda = lambda,
                                 RMSE = rmse(prediction = predict(glmnet.fit, poly(test.df$x, degree = 10),
                                                                  s = lambda),
                                             truth = test.df$y)))
}

A lambda of 0.05 seems about right…

best.lambda = performance$lambda[performance$RMSE == min(performance$RMSE)]
glmnet.fit = glmnet(poly(df$x, degree = 10), df$y)
coef(glmnet.fit, s = best.lambda)

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)  0.0101667
## 1           -5.2132586
## 2            .        
## 3            4.1759498
## 4            .        
## 5           -0.4643476
## 6            .        
## 7            .        
## 8            .        
## 9            .        
## 10           .

Regularization conclusion

Note that the model ends up using only 3 nonzero coefficients even though the model has the ability to use 10

Selecting a simpler model when more complicated models are possible is exactly the point of regularization

Cross Validation

So far we have just randomly dvidided our data into test and training

This approach has two drawbacks

the estimate of the RMSE on the test data can be highly variable, depending on precisely which observations are included in the test and training sets
in this approach, only the training data is used to fit the model. Since statistical models generally perform worse when trained on fewer observations, this suggests that the RMSE on the test data may actually be too large

One very useful refinement of the test-training data approach is cross-validation

k-fold Cross Validation

The idea behind \(k\)-fold Cross Validation is relatively simple

Divide the data into \(k\) roughly equal subsets and label them \(s = 1, ..., k\).
Fit your model using the \(k-1\) subsets other than subset \(s\)
Predict for subset \(s\) and calculate RMSE
Stop if \(s = k\), otherwise increment \(s\) by \(1\) and continue

The \(k\) fold CV estimate is computed by averaging the mean squared errors (\(\text{MSE}_1, ..., \text{MSE}_k\))

\[\text{CV}_k = \frac{1}{k}\sum_{i = 1}^{k} \text{MSE}_i\]

Common choices for \(k\) are 10 and 5.

CV can (and should) be used both to find tuning parameters and to report goodness-of-fit measures.

Prediction is fun

But be careful

New survey

Today

Example: predicting gender from weight/height

Supervised Learning

Introduction

Error

Out of sample error (continuous variables)

Example: predicting age of death

Out of sample error (discrete variables)

Test and training data

Example

Test & training data

Training data

Test data

Regularization

Ridge and Lasso regression

Back to our example…

Fitting the Lasso

glmnet output

Picking Lambda

Regularization conclusion

Cross Validation

k-fold Cross Validation

Prediction is fun

But be careful

`glmnet` output

Picking `Lambda`