GVPT722
Let’s look at the relationship between an outcome of interest, \(Y\), and a predictor of that outcome, \(X_1\):
\[ Y = \beta_0 + \beta_1X_1 + \epsilon \]
Let’s look at the relationship between an outcome of interest, \(Y\), and a predictor of that outcome, \(X_1\):
\[ Y = \beta_0 + \beta_1X_1 + \epsilon \]
Let’s assume that we know the true relationship between \(Y\) and \(X_1\):
\[ Y = 10 + 20X_1 + \epsilon \]
\[ Y = 10 + 20X_1 + \epsilon \]
This equation has two unknown variables: \(X_1\) and \(\epsilon\).
You need both to work out the value of \(Y\).
The error term captures all of the random things that inevitably muddy the relationship between our outcomes of interest and our predictors in the real world.
It is a set of random values.
We can learn about the shape of our random error.
For example, let’s assume that this random error:
Is normally distributed,
Has a mean of zero,
Has a standard deviation of 50.
Random error is always added to \(10 + 20X_1\) to produce \(Y\). Let’s simulate that process:
Random error is always added to \(10 + 20X_1\) to produce \(Y\). Let’s simulate that process:
Random error is always added to \(10 + 20X_1\) to produce \(Y\). Let’s simulate that process:
Assume \(X_1\) is equal to all whole numbers between one and 100:
Let’s find the 100 corresponding values of \(Y\):
In other words, what is the line that minimizes the distance between itself and all of these points?
Our fitted model is:
\[ Y = 14.331 + 20.130X_1 + \epsilon \]
Instead of this:
\[ Y = 10 + 20X_1 + \epsilon \]
Why?
We have information about how uncertain we are of these coefficients:
Our best guess of the intercept:
[1] 14.33133
Our level of uncertainty in that best guess:
Our best guess:
Our uncertainty around this best guess:
Traditionally, we are required to accept that 95 percent of all alternative coefficient estimates are plausible.
Tells us how likely we would be to observe the coefficient estimate that we did if it were actually equal to zero.
Regression models cannot prove causality.
This can make them difficult or awkward to interpret.
The National Election Survey asked respondents both their feelings towards President Obama (rating between zero and 100, with higher values indicating more support) and whether or not they own a dog.
Let’s fit a linear regression model against their responses to these two questions:
m <- lm(obama_therm ~ own_dog, data = nes)
modelsummary(m,
statistic = NULL,
stars = T,
coef_rename = c("own_dogYes" = "Owns a dog"))
(1) | |
---|---|
(Intercept) | 74.305*** |
Owns a dog | −9.286*** |
Num.Obs. | 1927 |
R2 | 0.023 |
R2 Adj. | 0.022 |
AIC | 18606.2 |
BIC | 18622.8 |
Log.Lik. | −9300.077 |
RMSE | 30.18 |
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
Our regression model is as follows:
\[ Obama\ thermometer = 74.305 - 9.286* Owns\ a\ dog + \epsilon \]
What does this mean substantively?
It is tempting to state that:
But, this suggests an effect for which we have no proof!
We would be suggesting that if we gave someone a dog, their support for Obama would drop by 9.28 points.
That’s not actually what we have found!
We have observed that, on average, people who were surveyed who had a dog had lower opinions of Obama than those who did not own a dog.
Regression models using observational data only allow us to make comparisons between our units of observation.
Here, we can make comparisons between respondents to the NES. We cannot, however, use this model to make statements about changes to any individual respondent.
\[ Obama\ thermometer = \\ 74.305 - 9.286 * Owns\ a\ dog + \epsilon \]
Imagine that I pulled someone randomly from the US voting population and asked them their feelings towards President Obama on a 100-point scale. What would be your best guess of their response?
The NES pulled 5,916 people randomly from the US voting population and asked them this very question.
What were their responses?
Imagine that the only information I provide to you is these 5,916 individuals’ responses.
Imagine that the only information I provide to you is these 5,916 individuals’ responses.
What other piece of information you would like to know about this random individual that might improve your guess?
What other piece of information you would like to know about this random individual that might improve your guess?
Are they a Democrat?
How accurately do we predict individuals’ feelings towards Obama?
caseid dem obama_therm pred_simple pred_party_id
1 408 0 15 60.74377 44.24474
2 3282 1 100 60.74377 85.30587
3 1942 0 70 60.74377 44.24474
4 118 1 30 60.74377 85.30587
5 5533 0 70 60.74377 44.24474
6 5880 0 45 60.74377 44.24474
7 1651 0 50 60.74377 44.24474
8 6687 0 60 60.74377 44.24474
9 5903 0 15 60.74377 44.24474
10 629 1 100 60.74377 85.30587
11 1434 1 NA 60.74377 85.30587
12 6380 0 0 60.74377 44.24474
What’s the sum of those distances?
pred |>
mutate(resid_simple = pred_simple - obama_therm,
resid_party_id = pred_party_id - obama_therm) |>
summarise(r_2_simple = scales::comma(sum(resid_simple^2, na.rm = T)),
r_2_party_id = scales::comma(sum(resid_party_id^2, na.rm = T)))
r_2_simple r_2_party_id
1 6,582,137 4,351,352