OLS Basics

GVPT722

Today’s class

Fitting a regression model
Interpreting statistical significance
Interpreting substantive meaning
Making predictions using your model

Packages for today

library(tidyverse)
library(broom)
library(ggdist)
library(poliscidata)
library(modelsummary)

Regression models

Let’s look at the relationship between an outcome of interest, \(Y\), and a predictor of that outcome, \(X_1\):

\[ Y = \beta_0 + \beta_1X_1 + \epsilon \]

Regression models

Let’s look at the relationship between an outcome of interest, \(Y\), and a predictor of that outcome, \(X_1\):

\[ Y = \beta_0 + \beta_1X_1 + \epsilon \]

Let’s assume that we know the true relationship between \(Y\) and \(X_1\):

\[ Y = 10 + 20X_1 + \epsilon \]

Solving for \(Y\)

\[ Y = 10 + 20X_1 + \epsilon \]

This equation has two unknown variables: \(X_1\) and \(\epsilon\).
You need both to work out the value of \(Y\).

Random error

The error term captures all of the random things that inevitably muddy the relationship between our outcomes of interest and our predictors in the real world.
It is a set of random values.

A world with no random error

A world with random error

Looking closely at that random error

We can learn about the shape of our random error.

For example, let’s assume that this random error:

Is normally distributed,
Has a mean of zero,
Has a standard deviation of 50.

Looking closely at that random error

ggplot(tibble(error = rnorm(1e6, mean = 0, sd = 50)), aes(x = error)) + 
  geom_density(fill = "#A2E3C4") + 
  theme_minimal()

Looking closely at that random error

Random error is always added to \(10 + 20X_1\) to produce \(Y\). Let’s simulate that process:

error <- rnorm(1, mean = 0, sd = 50)
error

[1] 24.46725

x_1 <- 1
x_1

[1] 1

y <- 10 + 20*x_1 + error
y

[1] 54.46725

Looking closely at that random error

Random error is always added to \(10 + 20X_1\) to produce \(Y\). Let’s simulate that process:

error <- rnorm(1, mean = 0, sd = 50)
error

[1] -35.64102

x_1 <- 2
x_1

[1] 2

y <- 10 + 20*x_1 + error
y

[1] 14.35898

Looking closely at that random error

Random error is always added to \(10 + 20X_1\) to produce \(Y\). Let’s simulate that process:

error <- rnorm(1, mean = 0, sd = 50)
error

[1] 32.34872

x_1 <- 3
x_1

[1] 3

y <- 10 + 20*x_1 + error
y

[1] 102.3487

Fitting our regression model

Assume \(X_1\) is equal to all whole numbers between one and 100:

x_1 <- 1:100

Let’s find the 100 corresponding values of \(Y\):

error <- rnorm(100, mean = 0, sd = 50)

df <- tibble(x_1 = x_1,
             y = 10 + 20*x_1 + error)

head(df, n = 5)

# A tibble: 5 × 2
    x_1     y
  <int> <dbl>
1     1  61.6
2     2  53.3
3     3 -14.4
4     4 178. 
5     5  86.5

Fitting our regression model

ggplot(df, aes(x = x_1, y = y)) + 
  geom_point() + 
  theme_minimal()

What is the line-of-best-fit?

In other words, what is the line that minimizes the distance between itself and all of these points?

m <- lm(y ~ x_1, data = df)

modelsummary(m, statistic = NULL, coef_rename = c(x_1 = "X1"))

	(1)
(Intercept)	14.331
X1	20.130
Num.Obs.	100
R2	0.992
R2 Adj.	0.992
AIC	1083.7
BIC	1091.5
Log.Lik.	−538.852
F	11796.835
RMSE	52.96

What is the line-of-best-fit?

Our fitted model is:

\[ Y = 14.331 + 20.130X_1 + \epsilon \]

Instead of this:

\[ Y = 10 + 20X_1 + \epsilon \]

Why?

Uncertainty around coefficients

We have information about how uncertain we are of these coefficients:

tidy(m)

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     14.3    10.8        1.33 1.87e-  1
2 x_1             20.1     0.185    109.   6.09e-104

Uncertainty around the intercept

Our best guess of the intercept:

intercept_est <- tidy(m) |> 
  filter(term == "(Intercept)") |> 
  pull(estimate)

intercept_est

[1] 14.33133

Our level of uncertainty in that best guess:

intercept_se <- tidy(m) |> 
  filter(term == "(Intercept)") |> 
  pull(std.error)

intercept_se

[1] 10.78078

Uncertainty around the intercept

Uncertainty around \(\beta_1\)

Our best guess:

beta_1_est <- tidy(m) |> 
  filter(term == "x_1") |> 
  pull(estimate)

beta_1_est

[1] 20.13028

Our uncertainty around this best guess:

beta_1_se <- tidy(m) |> 
  filter(term == "x_1") |> 
  pull(std.error)

beta_1_se

[1] 0.1853391

Uncertainty around \(\beta_1\)

Statistical significance

Traditionally, we are required to accept that 95 percent of all alternative coefficient estimates are plausible.

What estimates are included in this range?

modelplot(m) + 
  geom_vline(xintercept = 0, colour = "red")

P-values

Tells us how likely we would be to observe the coefficient estimate that we did if it were actually equal to zero.

tidy(m)

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)     14.3    10.8        1.33 1.87e-  1
2 x_1             20.1     0.185    109.   6.09e-104

Interpreting regression models: substantive meaning

Regression models cannot prove causality.
This can make them difficult or awkward to interpret.

Obama and dog owners

The National Election Survey asked respondents both their feelings towards President Obama (rating between zero and 100, with higher values indicating more support) and whether or not they own a dog.

Obama and dog owners

Let’s fit a linear regression model against their responses to these two questions:

m <- lm(obama_therm ~ own_dog, data = nes)

modelsummary(m,
             statistic = NULL,
             stars = T,
             coef_rename = c("own_dogYes" = "Owns a dog"))

	(1)
(Intercept)	74.305***
Owns a dog	−9.286***
Num.Obs.	1927
R2	0.023
R2 Adj.	0.022
AIC	18606.2
BIC	18622.8
Log.Lik.	−9300.077
RMSE	30.18
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Obama and dog owners

Our regression model is as follows:

\[ Obama\ thermometer = 74.305 - 9.286* Owns\ a\ dog + \epsilon \]

What does this mean substantively?

Obama and dog owners

It is tempting to state that:

The estimated effect of dog ownership on an individual’s feelings towards Obama is a decrease of 9.28 points, on average and holding all else constant.

But, this suggests an effect for which we have no proof!

We would be suggesting that if we gave someone a dog, their support for Obama would drop by 9.28 points.
That’s not actually what we have found!

Obama and dog owners

We have observed that, on average, people who were surveyed who had a dog had lower opinions of Obama than those who did not own a dog.

Regression models using observational data only allow us to make comparisons between our units of observation.
Here, we can make comparisons between respondents to the NES. We cannot, however, use this model to make statements about changes to any individual respondent.

Where do coefficient estimates come from?

\[ Obama\ thermometer = \\ 74.305 - 9.286 * Owns\ a\ dog + \epsilon \]

avg_responses <- nes |> 
  drop_na(own_dog) |> 
  group_by(own_dog) |> 
  summarise(avg_obama_therm = mean(obama_therm, 
                                   na.rm = T)) |> 
  mutate(diff = avg_obama_therm - lag(avg_obama_therm))

avg_responses

# A tibble: 2 × 3
  own_dog avg_obama_therm  diff
  <fct>             <dbl> <dbl>
1 No                 74.3 NA   
2 Yes                65.0 -9.29

Making predictions using our model

Imagine that I pulled someone randomly from the US voting population and asked them their feelings towards President Obama on a 100-point scale. What would be your best guess of their response?

Making predictions using our model

The NES pulled 5,916 people randomly from the US voting population and asked them this very question.

What were their responses?

nes |> 
  select(caseid, obama_therm) |> 
  slice_head(n = 10)

   caseid obama_therm
1     408          15
2    3282         100
3    1942          70
4     118          30
5    5533          70
6    5880          45
7    1651          50
8    6687          60
9    5903          15
10    629         100

Making predictions using our model

Imagine that the only information I provide to you is these 5,916 individuals’ responses.

Your educated best guess may then be the average of their response:

mean(nes$obama_therm, na.rm = T)

[1] 60.74377

Making predictions using our model

Imagine that the only information I provide to you is these 5,916 individuals’ responses.

Your educated best guess may then be the average of their response:

mean(nes$obama_therm, na.rm = T)

[1] 60.74377

It’s just fancy averaging!

lm(obama_therm ~ 1, data = nes)


Call:
lm(formula = obama_therm ~ 1, data = nes)

Coefficients:
(Intercept)  
      60.74

Making predictions using our model

What other piece of information you would like to know about this random individual that might improve your guess?

Making predictions using our model

What other piece of information you would like to know about this random individual that might improve your guess?

Are they a Democrat?

nes |> 
  select(caseid, obama_therm, dem) |> 
  slice_head(n = 10)

   caseid obama_therm dem
1     408          15   0
2    3282         100   1
3    1942          70   0
4     118          30   1
5    5533          70   0
6    5880          45   0
7    1651          50   0
8    6687          60   0
9    5903          15   0
10    629         100   1

Making predictions using our model

Now you can look at Democrats and non-Democrats separately:

obama_therm_dem <- nes |> 
  drop_na(obama_therm, dem) |> 
  group_by(dem) |> 
  summarise(avg_obama_therm = mean(obama_therm, na.rm = T))

obama_therm_dem

# A tibble: 2 × 2
    dem avg_obama_therm
  <dbl>           <dbl>
1     0            44.2
2     1            85.3

Making predictions using our model

Now you can look at Democrats and non-Democrats separately:

obama_therm_dem

# A tibble: 2 × 2
    dem avg_obama_therm
  <dbl>           <dbl>
1     0            44.2
2     1            85.3

lm(obama_therm ~ dem, data = nes) |> 
  tidy() |> 
  transmute(term, estimate, cum_est = estimate + lag(estimate))

# A tibble: 2 × 3
  term        estimate cum_est
  <chr>          <dbl>   <dbl>
1 (Intercept)     44.2    NA  
2 dem             41.1    85.3

Have we improved our guess?

How accurately do we predict individuals’ feelings towards Obama?

   caseid dem obama_therm pred_simple pred_party_id
1     408   0          15    60.74377      44.24474
2    3282   1         100    60.74377      85.30587
3    1942   0          70    60.74377      44.24474
4     118   1          30    60.74377      85.30587
5    5533   0          70    60.74377      44.24474
6    5880   0          45    60.74377      44.24474
7    1651   0          50    60.74377      44.24474
8    6687   0          60    60.74377      44.24474
9    5903   0          15    60.74377      44.24474
10    629   1         100    60.74377      85.30587
11   1434   1          NA    60.74377      85.30587
12   6380   0           0    60.74377      44.24474

How does each approach perform?

What’s the sum of those distances?

pred |> 
  mutate(resid_simple = pred_simple - obama_therm,
         resid_party_id = pred_party_id - obama_therm) |> 
  summarise(r_2_simple = scales::comma(sum(resid_simple^2, na.rm = T)),
            r_2_party_id = scales::comma(sum(resid_party_id^2, na.rm = T)))

  r_2_simple r_2_party_id
1  6,582,137    4,351,352