November, 2015
Logit example
Supervised Learning
Next time
Remember this?
Exercise: How did I calculate the line?
supervised learning: Models designed to infer a relationship from labeled training data.
labelled data: For each observation of the predictor variables, \(x_i, 1,..., n\) there is an associated response measurement \(y_i\)
Statistical learning models are designed to minimize out of sample error: the error rate you get on a new data set
Key ideas
Mean squared error (MSE):
\[\frac{1}{n} \sum_{i=1}^{n} (\text{prediction}_i - \text{truth}_i)^2\]
Root mean squared error
\[\sqrt{\frac{1}{n} \sum_{i=1}^{n} (\text{prediction}_i - \text{truth}_i)^2}\]
Question: what is the difference?
library("readr") df = read_csv("https://raw.githubusercontent.com/johnmyleswhite/ML_for_Hackers/master/05-Regression/data/longevity.csv") head(df)
## Source: local data frame [6 x 2] ## ## Smokes AgeAtDeath ## (int) (int) ## 1 1 75 ## 2 1 72 ## 3 1 66 ## 4 1 74 ## 5 1 69 ## 6 1 65
library("dplyr") df %>% group_by(Smokes) %>% summarise(mean.death = mean(AgeAtDeath))
## Source: local data frame [2 x 2] ## ## Smokes mean.death ## (int) (dbl) ## 1 0 75.254 ## 2 1 70.192
Let's look at MSE for different guesses of age of death
guess.accuracy = data.frame() for (guess in 63:83){ prediction.error = mean((df$AgeAtDeath - guess)^2) guess.accuracy = rbind(guess.accuracy, data.frame(Guess = guess, Error = prediction.error)) } head(guess.accuracy)
## Guess Error ## 1 63 127.451 ## 2 64 109.005 ## 3 65 92.559 ## 4 66 78.113 ## 5 67 65.667 ## 6 68 55.221
library("ggplot2") ggplot(guess.accuracy, aes(x = Guess, y = Error)) + geom_point() + geom_line() + theme_minimal()
One simple way to assess model accuracy when you have discrete outcomes (republican/democrat, professor/student, etc) could be the mean classification error
\[\text{Ave}(I(y_0 \neq \hat{y}_0))\]
But assessing model accuracy with discrete outcomes is often not straightforward.
You might also want to check out ROC curves
Accuracy on the training set (resubstitution accuracy) is optimistic
A better estimate comes from an independent set (test set accuracy)
But we can't use the test set when building the model or it becomes part of the training set
So we estimate the test set accuracy with the training set
Remember the bias-variance tradeoff
We now simulate some data to illustrate the difference between test and training data accuracy
set.seed(1) x = seq(0, 1, by = .01) y = sin(2 * pi * x) + rnorm(length(x), 0, .1) df = data.frame(x = x, y = y)
This is a sine wave with some noise
ggplot(data = df, aes(x = x, y = y)) + geom_line() + theme_minimal()
We now split the data into a test and training data set
n = length(x) indices = sort(sample(1:n, round(.5*n))) training.df = df[indices, ] test.df = df[-indices, ]
Define a RMSE function
rmse = function(prediction, truth){ return(sqrt(mean( (prediction - truth) ^2))) }
We want to get the best polynomial fit for the data. We achieve this by looping over a set of polynomials
performance = data.frame() for (d in 1:12){ poly.fit = lm(y ~ poly(x, degree = d), data = training.df) performance = rbind(performance, data.frame(degree = d, data = "training", RMSE = rmse(prediction = predict(poly.fit), truth = training.df$y))) performance = rbind(performance, data.frame(degree = d, data = "test", RMSE = rmse(prediction = predict(poly.fit, newdata = test.df), truth = test.df$y))) } head(performance)
## degree data RMSE ## 1 1 training 0.40622193 ## 2 1 test 0.50453141 ## 3 2 training 0.40115164 ## 4 2 test 0.52051182 ## 5 3 training 0.09954393 ## 6 3 test 0.13162960
ggplot(performance, aes(x = degree, y = RMSE, colour = data)) + geom_point() + geom_line() + theme_minimal()
Question: what happens here?
for (d in 1:12){ poly.fit = lm(y ~ poly(x, degree = d), data = training.df) training.df = cbind(training.df, predict(poly.fit)) test.df = cbind(test.df, predict(poly.fit, newdata = test.df)) } names(training.df) = c("x", "y", paste("degree:", 1:12, sep = " ")) names(test.df) = c("x", "y", paste("degree:", 1:12, sep = " ")) library("tidyr") training.df = training.df %>% gather(degree, prediction, -x, -y) test.df = test.df %>% gather(degree, prediction, -x, -y)
p = ggplot(training.df) + geom_point(aes(x = x, y = y), alpha = .3) + geom_line(aes(x = x, y = prediction), colour = "red") + facet_wrap(~ degree) + theme_minimal()
p = ggplot(test.df) + geom_point(aes(x = x, y = y), alpha = .3) + geom_line(aes(x = x, y = prediction), colour = "red") + facet_wrap(~ degree) + theme_minimal()
The problem with overfitting comes from our model being too complex
Complexity: models are complex when the number or size of the coefficients is large
One approach: punish the model for doing this
This approach is called regularization
Two popular models build on this approach: Ridge and Lasso
The approach is similar: include a loss function in the OLS minimization problem to prevent overfitting
\[\sum_{i = 1}^{n}(y_i - \beta_0 - \sum_{j = 1}^{p} b_j x_{ij})^2 + \text{LOSS}\]
This turns out to be very important
L1 regularization gives you sparse estimates (and therefore performs some form of variable selection)
We can examine our L1 and L2 norms as follows
lm.fit = lm(y ~ x, data = df) l2.norm = sum(coef(lm.fit)^2) l1.norm = sum(abs(coef(lm.fit))) print(paste0("l2.norm is ", l2.norm))
## [1] "l2.norm is 4.35231437896382"
print(paste0("l1.norm is ", l1.norm))
## [1] "l1.norm is 2.80299809176307"
Regularization methods are implemented in R in the glmnet
package (although it might also be worth checking out the newer caret
package)
library("glmnet") args(glmnet)
## function (x, y, family = c("gaussian", "binomial", "poisson", ## "multinomial", "cox", "mgaussian"), weights, offset = NULL, ## alpha = 1, nlambda = 100, lambda.min.ratio = ifelse(nobs < ## nvars, 0.01, 1e-04), lambda = NULL, standardize = TRUE, ## intercept = TRUE, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * ## 2 + 20, nvars), exclude, penalty.factor = rep(1, nvars), ## lower.limits = -Inf, upper.limits = Inf, maxit = 1e+05, type.gaussian = ifelse(nvars < ## 500, "covariance", "naive"), type.logistic = c("Newton", ## "modified.Newton"), standardize.response = FALSE, type.multinomial = c("ungrouped", ## "grouped")) ## NULL
alpha
controls the norm. alpha = 1
is the Lasso penalty, lasso = 0
is Ridge
Let's fit a Lasso model on our sinus wave data
library("glmnet") x = as.matrix(cbind(x,rev(x))) glmnet(x, y)
## ## Call: glmnet(x = x, y = y) ## ## Df %Dev Lambda ## [1,] 0 0.00000 0.542800 ## [2,] 1 0.09991 0.494600 ## [3,] 1 0.18290 0.450700 ## [4,] 1 0.25170 0.410600 ## [5,] 1 0.30890 0.374200 ## [6,] 1 0.35640 0.340900 ## [7,] 1 0.39580 0.310600 ## [8,] 1 0.42850 0.283000 ## [9,] 1 0.45570 0.257900 ## [10,] 1 0.47820 0.235000 ## [11,] 1 0.49690 0.214100 ## [12,] 1 0.51250 0.195100 ## [13,] 1 0.52540 0.177800 ## [14,] 1 0.53610 0.162000 ## [15,] 1 0.54500 0.147600 ## [16,] 1 0.55240 0.134500 ## [17,] 1 0.55850 0.122500 ## [18,] 1 0.56360 0.111600 ## [19,] 1 0.56780 0.101700 ## [20,] 1 0.57130 0.092680 ## [21,] 1 0.57420 0.084450 ## [22,] 1 0.57670 0.076940 ## [23,] 1 0.57870 0.070110 ## [24,] 1 0.58030 0.063880 ## [25,] 1 0.58170 0.058210 ## [26,] 1 0.58290 0.053030 ## [27,] 1 0.58380 0.048320 ## [28,] 1 0.58460 0.044030 ## [29,] 1 0.58530 0.040120 ## [30,] 1 0.58580 0.036550 ## [31,] 1 0.58630 0.033310 ## [32,] 1 0.58660 0.030350 ## [33,] 1 0.58700 0.027650 ## [34,] 1 0.58720 0.025200 ## [35,] 1 0.58740 0.022960 ## [36,] 1 0.58760 0.020920 ## [37,] 1 0.58780 0.019060 ## [38,] 1 0.58790 0.017370 ## [39,] 1 0.58800 0.015820 ## [40,] 1 0.58810 0.014420 ## [41,] 1 0.58810 0.013140 ## [42,] 1 0.58820 0.011970 ## [43,] 1 0.58830 0.010910 ## [44,] 1 0.58830 0.009938 ## [45,] 1 0.58830 0.009055 ## [46,] 1 0.58840 0.008251 ## [47,] 1 0.58840 0.007518 ## [48,] 1 0.58840 0.006850 ## [49,] 1 0.58840 0.006241 ## [50,] 1 0.58840 0.005687 ## [51,] 1 0.58840 0.005182 ## [52,] 1 0.58840 0.004721 ## [53,] 1 0.58850 0.004302 ## [54,] 1 0.58850 0.003920 ## [55,] 1 0.58850 0.003571
glmnet
outputThe output contains three columns
Df
: tells you how many coefficients in the model ended up being nonzero%Dev
: essentially a R2 for the modelLambda
: the loss parameterBecause Lambda
controls the values that we get from the model, it is often referred to as a hyperparameter
Large Lambda
: heavy penalty for model complexity
Lambda
Lambda
should be picked by the data! That is, which Lambda
minimizes RMSE in our test data
library('glmnet') training.df = df[indices, ] test.df = df[-indices, ] glmnet.fit = glmnet(poly(training.df$x, degree = 10), training.df$y) lambdas = glmnet.fit$lambda performance = data.frame() for (lambda in lambdas){ performance = rbind(performance, data.frame(lambda = lambda, RMSE = rmse(prediction = predict(glmnet.fit, poly(test.df$x, degree = 10), s = lambda), truth = test.df$y))) }
A lambda of 0.05 seems about right…
best.lambda = performance$lambda[performance$RMSE == min(performance$RMSE)] glmnet.fit = glmnet(poly(df$x, degree = 10), df$y) coef(glmnet.fit, s = best.lambda)
## 11 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.0101667 ## 1 -5.2132586 ## 2 . ## 3 4.1759498 ## 4 . ## 5 -0.4643476 ## 6 . ## 7 . ## 8 . ## 9 . ## 10 .
Note that the model ends up using only 3 nonzero coefficients even though the model has the ability to use 10
Selecting a simpler model when more complicated models are possible is exactly the point of regularization
So far we have just randomly dvidided our data into test and training
This approach has two drawbacks
One very useful refinement of the test-training data approach is cross-validation
The idea behind \(k\)-fold Cross Validation is relatively simple
The \(k\) fold CV estimate is computed by averaging the mean squared errors (\(\text{MSE}_1, ..., \text{MSE}_k\))
\[\text{CV}_k = \frac{1}{k}\sum_{i = 1}^{k} \text{MSE}_i\]
Common choices for \(k\) are 10 and 5.
CV can (and should) be used both to find tuning parameters and to report goodness-of-fit measures.