Assignment 2
Guest lecture: Zoltan Fazekas
Statistical learning
November, 2015
Assignment 2
Guest lecture: Zoltan Fazekas
Statistical learning
What is the objective of empirical policy research?
Today:
Most econometric theory is focused on estimating causal effects
Causal effect: what is the effect of some policy on an outcome we are interested in?
Examples of causal questions:
Variable of interest (often called treatment): \(D_i\)
Outcome of interest: \(Y_i\)
Potential outcome framework
\[ Y_i = \left\{ \begin{array}{rl} Y_{1i} & \text{if } D_i = 1,\\ Y_{0i} & \text{if } D_i = 0 \end{array} \right. \]
The observed outcome \(Y_i\) can be written in terms of potential outcomes as
\[ Y_i = Y_{0i} + (Y_{1i}-Y_{0i})D_i\]
\(Y_{1i}-Y_{0i}\) is the causal effect of \(D_i\) on \(Y_i\).
But we never observe the same individual \(i\) in both states (treated & non-treated).
This is the fundamental problem of causal inference.
We need some way of estimating the state we do not observe (the counterfactual)
Usually, our sample contains individuals from both states
So why not do a naive comparison of averages by treatment status?
\[E[Y_i|D_i = 1] - E[Y_i|D_i = 0] = \\ E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1] + \\ E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i = 0]\]
\(E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1] = E[Y_{1i} - Y_{0i}|D_i = 1]\): the average causal effect of \(D_i\) on \(Y\).
\(E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i = 0]\): difference in average \(Y_{0i}\) between the two groups. Likely to be different from 0 when individuals are allowed to self-select into treatment. Often referred to as selection bias.
Random assignment of \(D_i\) solves the problem because random assignment makes \(D_i\) independent of potential outcomes
That means that \(E[Y_{0i}|D_i = 1] = E[Y_{0i}|D_i = 0]\) and thus that the selection bias term is zero
Intuition: with random assignment, non-treated individuals can be used as counterfactuals for treated (what would have happened to individual \(i\) had he not received the treatment?)
This allows us to overcome the fundamental problem of causal inference
“no causation without manipulation”
Paul Holland (1986)
As mentioned, we need to worry when individuals are allowed to self-select
This means that a lot of thought has to go into the randomization phase
Randomization into treatment groups has to be manipulated by someone
Quasi-experiments: randomization happens by "accident"
Randomized controlled trials: randomization usually done by researcher
Note: difficult to say one is strictly better than the other. Randomization can be impractical and/or unethical.
Can you come up with an example where randomization would be unethical?
Causal questions are of key interest to policy makers and academics
The key focus is on inference: we want to know about the causal effect of \(D\) on \(Y\) in the population of interest
When you are interested in a causal question you need to think carefully about randomization of treatment (this is often referred to as your identification strategy)
Is causality the only thing policy makers and social scientists should be interested in?
“It's tough to make predictions, especially about the future.”
Storm P
Many policy problems are not about causality but rather about prediction
Standard empirical techniques are not optimized for prediction problems because they focus on unbiasedness
Remember the Gauss-Markov theorem from Econometrics B?
Gauss-Markov Theorem: In a regression model where \(E(u_i) = 0\) (no omitted variable bias) and \(V(u_i) = \sigma^2\) (homoskedasticity), the OLS estimator is BLUE (Best Linear Unbiased Estimator).
Keywords: unbiased (\(E(\hat{\beta}) = \beta\)) and best (smallest variance among the class of all linear unbiased estimators)
But what about biased estimators?
OLS is designed to minimize in sample error: the error rate you get on the same data set you used to build your predictor.
\[ \text{arg min}_{\beta} \sum_{i = 1}^{n} (y_i - \hat{y}_i)^2 \]
But for prediction we are interested in minimizing out of sample error: the error rate you get on a new data set
Too see this, consider a prediction at a new point, \(x_0\). Our prediction for \(y_0\) is then \(\hat{f}(x_0)\) and the mean squared error (MSE ) can be decomposed as
\[ E[(y_0 - \hat{f}(x_0))^2] = [\text{Bias}(\hat{f}(x_0))]^2 + V(\hat{f}(x_0)) + \sigma^2\]
By ensuring zero bias, OLS picks a corner solution. This is generally not optimal for prediction.
What do we mean by the variance and bias of an estimator?
\(\text{Bias}(\hat{f}(x_0)) = E[\hat{f}(x_0) - f(x_0)]\): Bias refers to the error that is introduced by approximating a real-life problem with a simple model. It won't fit the new data well.
\(V(\hat{f}(x_0))\): Model too complex. Small changes to the data will cause the solution to change a lot.
Machine learning techniques were developed specifically to maximize prediction performance by providing an empirical way to make this bias-variance trade off
But generally, that means that all our models are somewhat biased (making inference impossible)
In the next weeks, we will put particular emphasis on the following topics
General ideas
Supervised learning: Models designed to infer a relationship from labeled training data.
Unsupervised learning: Models designed to infer a relationship from unlabeled training data.
Statistical learning models are designed to optimally trade off bias and variance
This makes them more efficient for prediction than OLS
But also generally biased (so they are generally not meant for inference)
Statistical learning models can also be used for exploratory data analysis
Can we predict gender based on information on an individual's weight/height?
library("readr") library("dplyr") df = read_csv("https://raw.githubusercontent.com/johnmyleswhite/ML_for_Hackers/master/02-Exploration/data/01_heights_weights_genders.csv") glimpse(df)
## Observations: 10,000 ## Variables: 3 ## $ Gender (chr) "Male", "Male", "Male", "Male", "Male", "Male", "Male",... ## $ Height (dbl) 73.84702, 68.78190, 74.11011, 71.73098, 69.88180, 67.25... ## $ Weight (dbl) 241.8936, 162.3105, 212.7409, 220.0425, 206.3498, 152.2...
library("ggplot2") ggplot(df, aes(x = Height, fill = Gender)) + geom_density()
ggplot(df, aes(x = Weight, y = Height)) + geom_point(alpha = .3) + geom_smooth()
ggplot(df, aes(x = Weight, y = Height, colour = Gender)) + geom_point(alpha = .3)
df = df %>% mutate(gender = ifelse(Gender == "Male", 1, 0)) logit.model = glm(gender ~ Height + Weight, data = df, family = binomial(link = "logit")) summary(logit.model)
## ## Call: ## glm(formula = gender ~ Height + Weight, family = binomial(link = "logit"), ## data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -3.5073 -0.2211 -0.0006 0.2126 3.6794 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.69254 1.32846 0.521 0.602 ## Height -0.49262 0.02896 -17.013 <2e-16 *** ## Weight 0.19834 0.00513 38.663 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 13862.9 on 9999 degrees of freedom ## Residual deviance: 4182.6 on 9997 degrees of freedom ## AIC: 4188.6 ## ## Number of Fisher Scoring iterations: 7
ggplot(df, aes(x = Height, y = Weight)) + geom_point(aes(colour = Gender), alpha = .3) + geom_abline(intercept = - coef(logit.model)[1] / coef(logit.model)[2], slope = - coef(logit.model)[2] / coef(logit.model)[3], color = "black")
Exercise: How did I calculate the intercept
and slope
?
df$prediction = logit.model$fitted.values p = ggplot(sample_frac(df, .25), aes(x = Height, y = Weight)) + geom_point(aes(colour = Gender, size = abs(prediction - .5)), alpha = .3) + geom_abline(intercept = - coef(logit.model)[1] / coef(logit.model)[2], slope = - coef(logit.model)[2] / coef(logit.model)[3], color = "black")