install.packages("fitzRoy")
Regression
Set up
library(tidyverse)
library(fitzRoy)
library(broom)
library(modelsummary)
library(ggdist)
set.seed(1234)
Today, we are going to explore the linear relationship between two variables. At the end of this session you will be able to describe how those two variables relate to one another, and whether that relationship is statistically significant.
As usual, we will build our understanding using an example. As an Australian (more specifically, a Victorian), I love AFL. It’s the best sport out there! We’re going to depart from our usual political science examples to explore some interesting relationships in the way AFL is played.
First, let’s watch this brief introduction to the game:
Let’s explore the relationship between disposal efficiency and Dream Team points. Your disposal efficiency describes the percentage of your disposals (kicks, handballs, etc.) that lead to a positive outcome for your team. The Dream Team is a popular fantasy football competition. Each player gets points for various actions they take on the field. For example, you get six Dream Team points for scoring a goal and you lose three points for having a free kick awarded against you. This is a useful proxy measure of a player’s effectiveness on the field.
Loading in our data
We are going to explore some different player-level outcomes from the tenth round of the most recent AFLW season. This is the last round played before finals. First, we need to collect our data. The fitzRoy::fetch_player_stats()
retrieves some useful variables about each player from the official AFL website.
<- fetch_player_stats(2023, round = 10, comp = "AFLW")
afl_df afl_df
# A tibble: 378 × 70
providerId utcStartTime status compSeason.shortName round.name
<chr> <chr> <chr> <chr> <chr>
1 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
2 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
3 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
4 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
5 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
6 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
7 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
8 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
9 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
10 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10
# ℹ 368 more rows
# ℹ 65 more variables: round.roundNumber <int>, venue.name <chr>,
# home.team.name <chr>, home.team.club.name <chr>, away.team.name <chr>,
# away.team.club.name <chr>, player.jumperNumber <int>,
# player.photoURL <chr>, player.player.position <chr>,
# player.player.player.playerId <chr>, player.player.player.captain <lgl>,
# player.player.player.playerJumperNumber <int>, …
In this round, 378 players played in 9 games. We have access to 70 variables describing how these players performed in this round.
Uncovering the relationship between two variables
First, we want to determine the direction of the relationship between our variables of interest. When a player’s disposal efficiency increases, do they tend to get a greater or fewer number of Dream Team points?
We can often identify the direction of the relationship by plotting our data:
ggplot(afl_df, aes(x = disposalEfficiency, y = dreamTeamPoints)) +
geom_point() +
theme_minimal()
The relationship looks positive: as your efficiency increases, so too do your Dream Team points. This makes sense. A good player is more efficient (tends to produce good results from their kicks and handballs). Our Dream Team points are a proxy for an effective player. However, there is a fair bit of noise here. This relationship doesn’t look very strong.
Next, we want to formalize that relationship. We want to find the line that best fits all of these points. In other words, what line minimizes the distance between itself and all of the observed points marking each player’s Dream Team points and disposal efficiency?
Remember, this is the basis for Ordinary Least Squares regression.
We can use ggplot2::geom_smooth()
to fit this line graphically:
ggplot(afl_df, aes(x = disposalEfficiency, y = dreamTeamPoints)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal()
We correctly identified that there is a positive relationship between those two variables. Yay!
Often we need more information than simply the direction of a relationship. For example, we want to know how pronounced the effect of a one-unit increase in our independent variable will be on our outcome of interest. Here, we want to learn how many additional Dream Team points are associated with a a one percentage point increase in a player’s disposal efficiency, on average.
We can use lm()
to determine this information:
Remember to put your dependent variable first, then a ~
, then you independent variable(s).
<- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_df)
m summary(m)
Call:
lm(formula = dreamTeamPoints ~ disposalEfficiency, data = afl_df)
Residuals:
Min 1Q Median 3Q Max
-53.618 -17.270 -4.335 14.876 99.525
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.32863 4.57123 9.260 <2e-16 ***
disposalEfficiency 0.13289 0.07052 1.884 0.0603 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 24.99 on 376 degrees of freedom
Multiple R-squared: 0.009356, Adjusted R-squared: 0.006722
F-statistic: 3.551 on 1 and 376 DF, p-value: 0.06028
Great! We now have a linear regression model of the relationship between a player’s disposal efficiency and their Dream Team points.
We predict that players that have a disposal efficiency of 0 (none of their disposals result in a good outcome for their team) have, on average, 42.3 Dream Team points. Every one percentage point increase in a player’s efficiency is associated with a gain of 0.13 Dream Team points, on average.
Great! We can use this information to understand this relationship more meaningfully. For example, we note that a player with a disposal efficiency of 1.33 percentage points greater than another player has, on average, 10 more Dream Team points than the other player.
Uncertainty around this relationship
How confident can we be in this modeled relationship? Am I really sure that efficiency has a significant, positive impact on a player’s Dream Team points?
To answer these questions, we need to draw on everything we have learned about hypothesis testing and inference so far.
We want to make a claim about the relationship between some outcome and some variables that we think are important determinants of that outcome. Here, we used the observed Dream Team points gained by a player and their disposal efficiency for round 10 in the 2023 season of AFLW to make a broader statement about the relationship between efficiency and Dream Team points. We are not actually interested in the specific relationship between these variables in round 10 of the 2023 season: we want to know whether a player’s efficiency is an important determinant of their Dream Team points generally.
In other words, we are inferring from a sample a general relationship. You can imagine that if we had a different sample, a linear regression model would have a different set of estimates for our intercept and coefficient. They would (hopefully) look very similar to the ones we found above, but they would be slightly different. For example, imagine that some of the random elements of a game of AFL were different: the wind blew in a different way, the crowd cheered a little louder, a player ran a little faster. These would change the game played in slight and random ways. A player might make a clanger instead of a clean disposal. The wind may blow an otherwise goal into a behind. Consequently, the players’ disposal efficiency would be slightly different. We would subsequently get different model estimates.
Let’s illustrate this. To be able to access this variation to illustrate our point, I am going to take pure random samples from our 378 players.
We are solidly in the multiverse world with this AFL example. We did not take a random sample of players who played in round 10, collect information on their Dream Team points and efficiency, and fit our regression. Rather, we have information about every player.
This is also often the case in political science. For example, we have full information on the number of wars fought between countries in the modern era. We also have a lot of data on those countries. We do not; therefore, need to take a sample from that population (unlike pollsters attempting to work out a politician’s level of support from all US voters).
However, this does not mean that we know with certainty the relationship between our variables of interest. Just as with sport, there is so much randomness that can effect these outcomes. We want to capture and account for that randomness. So, we have to treat these data as samples.
Let’s take a completely random sample of 250 players from our data and fit our regression against that sample:
<- afl_df |>
afl_sample_1 sample_n(250) |>
select(player.player.player.givenName,
player.player.player.surname,
dreamTeamPoints,
disposalEfficiency) afl_sample_1
# A tibble: 250 × 4
player.player.player.givenName player.player.player.surname dreamTeamPoints
<chr> <chr> <dbl>
1 Simone Nalder 21
2 Emelia Yassir 20
3 Ashleigh Saint 76
4 Chloe Dalton 44
5 Renee Garing 92
6 Ebony O'Dea 47
7 Jasmin Stewart 43
8 Sophie Conway 68
9 Hannah Ewings 22
10 Courtney Jones 43
# ℹ 240 more rows
# ℹ 1 more variable: disposalEfficiency <dbl>
<- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample_1)
m
tidy(m)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 36.8 5.20 7.07 1.54e-11
2 disposalEfficiency 0.215 0.0810 2.65 8.47e- 3
Okay, so we got a different intercept and coefficient estimates than we got above. This is despite the fact that I took a pure random sample from the full set of players. The only thing driving this difference is random chance.
Let’s do this again. I will take a different random sample of players and fit a regression for them:
<- afl_df |>
afl_sample_2 sample_n(250) |>
select(player.player.player.givenName,
player.player.player.surname,
dreamTeamPoints,
disposalEfficiency) afl_sample_2
# A tibble: 250 × 4
player.player.player.givenName player.player.player.surname dreamTeamPoints
<chr> <chr> <dbl>
1 Kodi Jacques 50
2 Dana East 64
3 Cambridge McCormick 48
4 Rachelle Martin 74
5 Madison Newman 94
6 Courtney Jones 43
7 Mackenzie Webb 30
8 Jessica Low 36
9 Erin Phillips 76
10 Justine Mules 35
# ℹ 240 more rows
# ℹ 1 more variable: disposalEfficiency <dbl>
<- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample_2)
m
tidy(m)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 41.3 5.32 7.75 2.31e-13
2 disposalEfficiency 0.165 0.0827 2.00 4.71e- 2
Again, we got a different set of coefficients. This is only the result of random chance.
Let’s go again:
<- afl_df |>
afl_sample_3 sample_n(250) |>
select(player.player.player.givenName,
player.player.player.surname,
dreamTeamPoints,
disposalEfficiency) afl_sample_3
# A tibble: 250 × 4
player.player.player.givenName player.player.player.surname dreamTeamPoints
<chr> <chr> <dbl>
1 Taylor Smith 47
2 Alice O'Loughlin 46
3 Zarlie Goldsworthy 103
4 Chelsea Biddell 58
5 Courtney Hodder 92
6 Mim Strom 75
7 Jade Ellenger 92
8 Zoe Wakfer 32
9 Isabel Huntington 37
10 Brooke Brown 58
# ℹ 240 more rows
# ℹ 1 more variable: disposalEfficiency <dbl>
<- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample_3)
m
tidy(m)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 42.3 5.40 7.84 1.35e-13
2 disposalEfficiency 0.126 0.0838 1.50 1.34e- 1
We can see these differences:
|>
afl_sample_1 mutate(sample = "One") |>
bind_rows(
mutate(afl_sample_2, sample = "Two")
|>
) bind_rows(
mutate(afl_sample_3, sample = "Three")
|>
) ggplot(aes(x = disposalEfficiency, y = dreamTeamPoints, colour = sample)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = F) +
theme_minimal()
Let’s do this many times:
<- function(i) {
afl_sample_regression
<- sample_n(afl_df, 250)
afl_sample
<- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample)
m
<- tibble(round = i, model = tidy(m))
model_results
return(model_results)
}
<- map(1:1000, afl_sample_regression, .progress = T) |>
afl_regressions bind_rows()
Now we can take a look at those different regression coefficients:
|>
afl_regressions unnest(model)
# A tibble: 2,000 × 6
round term estimate std.error statistic p.value
<int> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 39.1 5.25 7.45 1.51e-12
2 1 disposalEfficiency 0.172 0.0811 2.12 3.54e- 2
3 2 (Intercept) 45.6 5.74 7.94 6.85e-14
4 2 disposalEfficiency 0.0719 0.0871 0.826 4.10e- 1
5 3 (Intercept) 43.9 5.90 7.44 1.68e-12
6 3 disposalEfficiency 0.110 0.0912 1.21 2.28e- 1
7 4 (Intercept) 37.8 5.35 7.08 1.51e-11
8 4 disposalEfficiency 0.190 0.0829 2.29 2.29e- 2
9 5 (Intercept) 42.2 5.69 7.43 1.80e-12
10 5 disposalEfficiency 0.140 0.0884 1.58 1.16e- 1
# ℹ 1,990 more rows
This shows the intercept and coefficient for the relationship between disposal efficiency and Dream Team points for each of the models we generated from our 1,000 different random samples. Remember, the only thing driving the differences between each model’s intercept and coefficient is random chance.
Let’s start with the disposal efficiency coefficients:
|>
afl_regressions unnest(model) |>
filter(term == "disposalEfficiency") |>
ggplot(aes(x = estimate)) +
stat_halfeye() +
theme_minimal() +
labs(caption = "Median is shown with point. One and two standard deviations are shown by the black bars.")
Here is the distribution of those 1,000 different coefficients for disposal efficiency for each model.
Next, we can look at the intercepts:
|>
afl_regressions unnest(model) |>
filter(term == "(Intercept)") |>
ggplot(aes(x = estimate)) +
stat_halfeye() +
theme_minimal() +
labs(caption = "Median is shown with point. One and two standard deviations are shown by the black bars.")
You can see that one coefficient and estimate is common (represented by the peak of these two density plots) and that other estimates fall roughly symmetrically around this most common estimate.
In fact, if we ran an infinite number of regressions from an infinite number of random samples from our data, these coefficients would be distributed following the t-distribution.
We can use this knowledge to focus back on that single regression model that we built right at the start of this session.
<- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_df)
m tidy(m)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 42.3 4.57 9.26 1.59e-18
2 disposalEfficiency 0.133 0.0705 1.88 6.03e- 2
Building uncertainty into our single regression model
Generally, we have a single sample and we use that sample to fit our models. We still need to account for the uncertainty (demonstrated above) created by random chance. We know that from an infinite number of samples, we would fit linear models with coefficients that follow the t-distribution. So, let’s use that knowledge to work out how certain we are of our estimates found above.
We have our point estimates for our intercept and coefficient:
tidy(m) |>
select(term, estimate)
# A tibble: 2 × 2
term estimate
<chr> <dbl>
1 (Intercept) 42.3
2 disposalEfficiency 0.133
These are our best guess at the true linear relationship between disposal efficiency and Dream Team points. Our best guess is that players who have a disposal efficiency of zero have, on average, 42.3 Dream Team points. Every one percentage point increase in a player’s efficiency is associated with a gain of 0.13 Dream Team points, on average.
Perfect. Let’s set these best guesses as our center points.
Now, we need to build out the plausible set of intercepts and coefficients that would result from an infinite number of random samples drawn from our population. To do this, we need to work out how spread out these intercepts and coefficients would be from their center points. Formally, this spread is called the standard deviation.
The standard deviation, \(s\), is calculated using two pieces of information. First, it looks at how well our line of best fit (our regression model) fits our observed data. How far are the predicted values (represented by the blue line on the below graph) from the observed values (represented by the black dots)?
ggplot(afl_df, aes(x = disposalEfficiency, y = dreamTeamPoints)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
theme_minimal()
We will have more uncertainty around our intercept and coefficient if the distance between the predicted values and observed values is large. This makes sense: if the predicted values and the observed values were very similar, we are unlikely to find a wildly different line of best fit from a different random sample from our population.
The standard deviation also takes into account the amount of information you have used to generate this intercept and coefficient. Greater sample sizes using less variables result in less uncertainty around those model coefficients.
In sum, the standard deviation is calculated as:
\[ s = \sqrt{\frac{\sum{e_i^2}}{n-k-1}} \]
Where \(e_i\) is the difference between each predicted value and observed value (the vertical distance between each point and the blue line on the graph above).
\(n-k-1\) is the degrees of freedom in the model. It accounts for the amount of information you used to build the model, with \(n\) equal to the number of observations you used and \(k\) equal to the number of independent variables you included. Here, our \(k=1\) because we are only using one independent variable: disposal efficiency.
We can use this spread to work out our standard errors around our coefficients. Broadly, the standard error places this uncertainty within the context of the model.
For the intercept, it is:
\[ SE_{\beta_0} = s(\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x_i-\bar{x})^2}}}) \]
For the coefficients, it is:
\[ SE_{\beta_1} = \frac{s}{\sqrt{\sum{(x_i-\bar{x})^2}}} \]
You will calculate all of this in the background. Do not worry about memorizing these formulas. I put them here to help build your intuition about what goes into your uncertainty around the relationship you find between your outcome of interest and the variables you think drive that outcome.
Happily, broom::tidy()
calculates all of this for us:
tidy(m) |>
select(term:std.error)
# A tibble: 2 × 3
term estimate std.error
<chr> <dbl> <dbl>
1 (Intercept) 42.3 4.57
2 disposalEfficiency 0.133 0.0705
Okay, so we have a good sense of the spread around our best guess. Let’s visualize this:
tibble(
x = rnorm(1e6,
mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
|>
) ggplot(aes(x = x)) +
stat_halfeye() +
theme_minimal()
There are two things to note here. The first is that I am using the normal distribution, not the t-distribution. As far as I can tell, there isn’t a way to change the center or spread of random draws from the t-distribution in R (done using rt()
). So, to illustrate I am using random draws from the normal distribution (done using rnorm()
). At these large degrees of freedom, these distributions are almost identical, so we can roll with this.
Second, I am pulling 1 million random draws from this distribution. In theory, we talk about the distribution of an infinite number of coefficients built from an infinite number of samples. I, obviously, cannot do this, so I use 1 million as my stand-in for infinity.
Above, we show the plausible set of coefficients describing the effect of a one percentage point increase in a player’s disposal efficiency on their Dream Team points, on average. Remember, our best guess (taken from our regression model) is that a one percentage point increase in efficiency is associated with a 0.13 increase in a player’s Dream Team points, on average. However, as we saw above when we built 1,000 different models from 1,000 different random samples, it is entirely plausible that we could get different coefficients from a model built from a different sample. These differences are the product of random chance. That’s fine! We just need to acknowledge that.
Let’s also visualize all of the plausible intercepts we could get:
tibble(
x = rnorm(1e6,
mean = tidy(m) |> filter(term == "(Intercept)") |> pull(estimate),
sd = tidy(m) |> filter(term == "(Intercept)") |> pull(std.error))
|>
) ggplot(aes(x = x)) +
stat_halfeye() +
theme_minimal()
Again, our best guess is that players who have a disposal efficiency of 0 have, on average, 42.3 Dream Team points. This is one of many plausible intercepts that we could generate.
Connecting this to our research question
We used our data to uncover an interesting relationship between disposal efficiency and Dream Team points, on average. As your disposal efficiency increases, so too do your Dream Team points. We found that if players who increased their disposal efficiency by one percentage point saw an increase of 0.13 Dream Team points, on average.
Now, imagine that you are a player and you want to increase your Dream Team points. Do you trust this? Do you really believe that by increasing your disposal efficiency you will increase your Dream Team points? What if, in fact, there is no relationship between these two variables?
We know from above that there are a range of plausible intercepts and coefficients that result from random chance. Does this plausible range include zero? In other words, is it plausible that there is no relationship between the outcome of interest (Dream Team points) and your independent variable (disposal efficiency)?
To answer this question, we draw on our knowledge of hypothesis testing.
First, we need to define what we mean by a plausible range of relationships. Traditionally, we are willing to accept a five percent risk that we declare that there is a relationship between an outcome and an independent variable when there is, in fact, no relationship. Let’s stick with that threshold.
Okay, so now we need to work out where the other 95 percent of plausible values sit. Let’s start with the coefficient describing the relationship between Dream Team points and disposal efficiency. Here is our representation of the coefficients drawn from an infinite number random samples pulled from our population (it is the same as printed above):
tibble(
x = rnorm(1e6,
mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
|>
) ggplot(aes(x = x)) +
stat_halfeye() +
theme_minimal()
Where do 95 percent of these coefficients fall around our best guess? We can use our knowledge of the t-distribution to answer this question.
We know that center point (our coefficient):
<- tidy(m) |>
beta_1 filter(term == "disposalEfficiency") |>
pull(estimate)
beta_1
[1] 0.1328924
We know how spread out our alternative coefficients are from that center point (our standard error):
<- tidy(m) |>
se_beta_1 filter(term == "disposalEfficiency") |>
pull(std.error)
se_beta_1
[1] 0.0705205
And we know our threshold (5%). We need to translate that threshold into its t-statistic, accounting for the degrees of freedom we have access to:
<- nrow(afl_df)
sample_size sample_size
[1] 378
<- 1
no_of_IVs no_of_IVs
[1] 1
<- sample_size - no_of_IVs - 1
df df
[1] 376
Therefore:
<- qt(0.025, df = df, lower.tail = F)
t_stat_95 t_stat_95
[1] 1.966293
This is the point beyond which 2.5% of all values along the t-distribution fall.
Remember, we want to find out where 95 percent of these alternative coefficients sit around the center point. So, we need to distribute our remaining five percent between the two tails. That’s why we find the t-statistic beyond which 2.5% of the data fall.
Now we can find the boundaries within which 95 percent of these alternative coefficients fall:
Remember:
\[ CI = \beta_1 \pm t*SE_{\beta_1} \]
<- beta_1 - t_stat_95*se_beta_1
lower_ci lower_ci
[1] -0.005771574
<- beta_1 + t_stat_95*se_beta_1
upper_ci upper_ci
[1] 0.2715564
Let’s place those in context:
tibble(
x = rnorm(1e6,
mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
|>
) ggplot(aes(x = x)) +
stat_slab(aes(fill_ramp = after_stat(x < lower_ci | x > upper_ci))) +
theme_minimal() +
theme(legend.position = "none")
Brilliant! Now we know where 95 percent of all alternative coefficients drawn from a random sample of our population could fall. These are our plausible alternative coefficients.
So, do these plausible alternatives include zero? In other words, is it plausible that there is no relationship between a player’s disposal efficiency and their Dream Team points?
tibble(
x = rnorm(1e6,
mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
|>
) ggplot(aes(x = x)) +
stat_slab(aes(fill_ramp = after_stat(x < lower_ci | x > upper_ci))) +
geom_vline(xintercept = 0) +
theme_minimal() +
theme(legend.position = "none")
Yes! A null relationship sits within our plausible set of coefficients for disposal efficiency. We cannot reject the notion that there is no relationship between disposal efficiency and Dream Team points at this threshold.
What about a more forgiving threshold? What if we are willing to accept a 10 percent chance that we reject a true null relationship?
First, we need to find this new threshold’s t-statistic:
<- qt(0.05, df = df, lower.tail = F)
t_stat_90 t_stat_90
[1] 1.648916
And; therefore, our new boundaries (within which 90 percent of alternative coefficients sit):
<- beta_1 - t_stat_90*se_beta_1
lower_ci lower_ci
[1] 0.01661001
<- beta_1 + t_stat_90*se_beta_1
upper_ci upper_ci
[1] 0.2491748
Do these contain a null relationship?
tibble(
x = rnorm(1e6,
mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
|>
) ggplot(aes(x = x)) +
stat_slab(aes(fill_ramp = after_stat(x < lower_ci | x > upper_ci))) +
geom_vline(xintercept = 0) +
theme_minimal() +
theme(legend.position = "none")
No! So, if we are happy to accept a 10 percent risk that we will reject a true null relationship, we can reject the null hypothesis that there is no relationship between disposal efficiency and Dream Team points. We can tell the players that we have found a statistically significant relationship between disposal efficiency and Dream Team points at the 0.1 threshold.
We can approach this question from the other direction. If the null hypothesis were true, how likely would we be to see our estimate?
First, let’s set up our null world:
tibble(
x = rt(1e6, df = df)
|>
) ggplot(aes(x = x)) +
stat_slab() +
theme_minimal() +
theme(legend.position = "none")
I am back to random draws from our t-distribution.
Here, we have centered our distribution of alternative coefficients (resulting only from differences in our samples brought about by random chance) at zero. We are in the null world: there is no relationship between disposal efficiency and Dream Team points.
Where does the estimate we found in our sample sit within this null world? First, we need to translate that observed estimate into its t-statistic:
\[ t = \frac{\beta_1}{SE_{\beta_1}} \]
<- beta_1 / se_beta_1
t_stat t_stat
[1] 1.884451
Let’s place this in the context of the null world:
tibble(
x = rt(1e6, df = df)
|>
) ggplot(aes(x = x)) +
stat_slab() +
geom_vline(xintercept = t_stat) +
theme_minimal() +
theme(legend.position = "none")
How likely is it that I would get this coefficient or a more extreme coefficient if we did, in fact, live in the null world? In other words, what proportion of these alternative coefficients (highlighted in dark gray on the graph below) are equal to or greater than our observed estimate?
Remember that we are conducting a two-tailed test of our null hypothesis. We need to be open to the estimate being greater or smaller than the null.
tibble(
x = rt(1e6, df = df)
|>
) ggplot(aes(x = x)) +
stat_slab(aes(fill_ramp = after_stat(x < -t_stat | x > t_stat))) +
geom_vline(xintercept = t_stat) +
theme_minimal() +
theme(legend.position = "none")
The proportion of alternative coefficients that are equal to or more extreme than the one we observed is:
<- 2 * pt(t_stat, df = df, lower.tail = F)
p_value p_value
[1] 0.06027518
We would observe an estimate of 0.133 or a more extreme estimate 6% of the time if the null hypothesis were true. When we are only happy to accept a five percent chance that we would reject a true null hypothesis, we cannot reject that null hypothesis. If, on the other hand, we are happy to accept a 10 percent chance that we reject a true null hypothesis, we can reject it.
Reading our regression outputs
Let’s return to our original model:
tidy(m)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 42.3 4.57 9.26 1.59e-18
2 disposalEfficiency 0.133 0.0705 1.88 6.03e- 2
We now have information on every component of this output.
We can translate these estimates, in the estimate
column. Our model suggests that players that have a disposal efficiency of 0 (none of their disposals result in a good outcome for their team) have, on average, 42.3 Dream Team points. Every one percentage point increase in a player’s efficiency is associated with a gain of 0.13 Dream Team points, on average.
We know that these estimates are our best guess of the true linear relationship between Dream Team points and disposal efficiency. Our best guess may be different from the true relationship because of random chance. How confident are we of that best guess? Well, we know from the standard error (std.error
) how spread out around that best guess alternative coefficients sit.
We know where our estimate (translated into its t-statistic, or statistic
) sits within the null world.
And finally, we know the probability (p.value
) that we would observe the estimate we found if it were actually equal to zero.
All of that work we did above is replicated here, in this one line of code. In fact, we can also add our confidence intervals around our estimate to the broom::tidy()
output:
tidy(m, conf.int = T)
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 42.3 4.57 9.26 1.59e-18 33.3 51.3
2 disposalEfficiency 0.133 0.0705 1.88 6.03e- 2 -0.00577 0.272
These are exactly as we manually calculated above.
What would this all look like in a published article?
modelsummary(m,
coef_rename = c("disposalEfficiency" = "Disposal efficiency"),
statistic = c("std.error", "p.value"))
(1) | |
---|---|
(Intercept) | 42.329 |
(4.571) | |
(<0.001) | |
Disposal efficiency | 0.133 |
(0.071) | |
(0.060) | |
Num.Obs. | 378 |
R2 | 0.009 |
R2 Adj. | 0.007 |
AIC | 3510.0 |
BIC | 3521.8 |
Log.Lik. | −1751.992 |
F | 3.551 |
RMSE | 24.93 |
Increasingly, people are including confidence intervals:
modelsummary(m,
coef_rename = c("disposalEfficiency" = "Disposal efficiency"),
statistic = c("conf.int", "p.value"))
(1) | |
---|---|
(Intercept) | 42.329 |
[33.340, 51.317] | |
(<0.001) | |
Disposal efficiency | 0.133 |
[−0.006, 0.272] | |
(0.060) | |
Num.Obs. | 378 |
R2 | 0.009 |
R2 Adj. | 0.007 |
AIC | 3510.0 |
BIC | 3521.8 |
Log.Lik. | −1751.992 |
F | 3.551 |
RMSE | 24.93 |
Or presenting their regression results graphically:
modelplot(m,
coef_rename = c("disposalEfficiency" = "Disposal efficiency"))