Regression

Set up

install.packages("fitzRoy")
library(tidyverse)
library(fitzRoy)
library(broom)
library(modelsummary)
library(ggdist)

set.seed(1234)

Today, we are going to explore the linear relationship between two variables. At the end of this session you will be able to describe how those two variables relate to one another, and whether that relationship is statistically significant.

As usual, we will build our understanding using an example. As an Australian (more specifically, a Victorian), I love AFL. It’s the best sport out there! We’re going to depart from our usual political science examples to explore some interesting relationships in the way AFL is played.

First, let’s watch this brief introduction to the game:

Let’s explore the relationship between disposal efficiency and Dream Team points. Your disposal efficiency describes the percentage of your disposals (kicks, handballs, etc.) that lead to a positive outcome for your team. The Dream Team is a popular fantasy football competition. Each player gets points for various actions they take on the field. For example, you get six Dream Team points for scoring a goal and you lose three points for having a free kick awarded against you. This is a useful proxy measure of a player’s effectiveness on the field.

Loading in our data

We are going to explore some different player-level outcomes from the tenth round of the most recent AFLW season. This is the last round played before finals. First, we need to collect our data. The fitzRoy::fetch_player_stats() retrieves some useful variables about each player from the official AFL website.

afl_df <- fetch_player_stats(2023, round = 10, comp = "AFLW")
afl_df
# A tibble: 378 × 70
   providerId      utcStartTime           status compSeason.shortName round.name
   <chr>           <chr>                  <chr>  <chr>                <chr>     
 1 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 2 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 3 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 4 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 5 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 6 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 7 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 8 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
 9 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
10 CD_M20232641001 2023-11-03T08:45:00.0… CONCL… 2023 NAB AFLW Season Round 10  
# ℹ 368 more rows
# ℹ 65 more variables: round.roundNumber <int>, venue.name <chr>,
#   home.team.name <chr>, home.team.club.name <chr>, away.team.name <chr>,
#   away.team.club.name <chr>, player.jumperNumber <int>,
#   player.photoURL <chr>, player.player.position <chr>,
#   player.player.player.playerId <chr>, player.player.player.captain <lgl>,
#   player.player.player.playerJumperNumber <int>, …

In this round, 378 players played in 9 games. We have access to 70 variables describing how these players performed in this round.

Uncovering the relationship between two variables

First, we want to determine the direction of the relationship between our variables of interest. When a player’s disposal efficiency increases, do they tend to get a greater or fewer number of Dream Team points?

We can often identify the direction of the relationship by plotting our data:

ggplot(afl_df, aes(x = disposalEfficiency, y = dreamTeamPoints)) + 
  geom_point() + 
  theme_minimal()

The relationship looks positive: as your efficiency increases, so too do your Dream Team points. This makes sense. A good player is more efficient (tends to produce good results from their kicks and handballs). Our Dream Team points are a proxy for an effective player. However, there is a fair bit of noise here. This relationship doesn’t look very strong.

Next, we want to formalize that relationship. We want to find the line that best fits all of these points. In other words, what line minimizes the distance between itself and all of the observed points marking each player’s Dream Team points and disposal efficiency?

Tip

Remember, this is the basis for Ordinary Least Squares regression.

We can use ggplot2::geom_smooth() to fit this line graphically:

ggplot(afl_df, aes(x = disposalEfficiency, y = dreamTeamPoints)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) + 
  theme_minimal()

We correctly identified that there is a positive relationship between those two variables. Yay!

Often we need more information than simply the direction of a relationship. For example, we want to know how pronounced the effect of a one-unit increase in our independent variable will be on our outcome of interest. Here, we want to learn how many additional Dream Team points are associated with a a one percentage point increase in a player’s disposal efficiency, on average.

We can use lm() to determine this information:

Tip

Remember to put your dependent variable first, then a ~, then you independent variable(s).

m <- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_df)
summary(m)

Call:
lm(formula = dreamTeamPoints ~ disposalEfficiency, data = afl_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.618 -17.270  -4.335  14.876  99.525 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        42.32863    4.57123   9.260   <2e-16 ***
disposalEfficiency  0.13289    0.07052   1.884   0.0603 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.99 on 376 degrees of freedom
Multiple R-squared:  0.009356,  Adjusted R-squared:  0.006722 
F-statistic: 3.551 on 1 and 376 DF,  p-value: 0.06028

Great! We now have a linear regression model of the relationship between a player’s disposal efficiency and their Dream Team points.

We predict that players that have a disposal efficiency of 0 (none of their disposals result in a good outcome for their team) have, on average, 42.3 Dream Team points. Every one percentage point increase in a player’s efficiency is associated with a gain of 0.13 Dream Team points, on average.

Great! We can use this information to understand this relationship more meaningfully. For example, we note that a player with a disposal efficiency of 1.33 percentage points greater than another player has, on average, 10 more Dream Team points than the other player.

Uncertainty around this relationship

How confident can we be in this modeled relationship? Am I really sure that efficiency has a significant, positive impact on a player’s Dream Team points?

To answer these questions, we need to draw on everything we have learned about hypothesis testing and inference so far.

We want to make a claim about the relationship between some outcome and some variables that we think are important determinants of that outcome. Here, we used the observed Dream Team points gained by a player and their disposal efficiency for round 10 in the 2023 season of AFLW to make a broader statement about the relationship between efficiency and Dream Team points. We are not actually interested in the specific relationship between these variables in round 10 of the 2023 season: we want to know whether a player’s efficiency is an important determinant of their Dream Team points generally.

In other words, we are inferring from a sample a general relationship. You can imagine that if we had a different sample, a linear regression model would have a different set of estimates for our intercept and coefficient. They would (hopefully) look very similar to the ones we found above, but they would be slightly different. For example, imagine that some of the random elements of a game of AFL were different: the wind blew in a different way, the crowd cheered a little louder, a player ran a little faster. These would change the game played in slight and random ways. A player might make a clanger instead of a clean disposal. The wind may blow an otherwise goal into a behind. Consequently, the players’ disposal efficiency would be slightly different. We would subsequently get different model estimates.

Let’s illustrate this. To be able to access this variation to illustrate our point, I am going to take pure random samples from our 378 players.

Note

We are solidly in the multiverse world with this AFL example. We did not take a random sample of players who played in round 10, collect information on their Dream Team points and efficiency, and fit our regression. Rather, we have information about every player.

This is also often the case in political science. For example, we have full information on the number of wars fought between countries in the modern era. We also have a lot of data on those countries. We do not; therefore, need to take a sample from that population (unlike pollsters attempting to work out a politician’s level of support from all US voters).

However, this does not mean that we know with certainty the relationship between our variables of interest. Just as with sport, there is so much randomness that can effect these outcomes. We want to capture and account for that randomness. So, we have to treat these data as samples.

Let’s take a completely random sample of 250 players from our data and fit our regression against that sample:

afl_sample_1 <- afl_df |> 
  sample_n(250) |>
  select(player.player.player.givenName, 
         player.player.player.surname,
         dreamTeamPoints,
         disposalEfficiency)
afl_sample_1
# A tibble: 250 × 4
   player.player.player.givenName player.player.player.surname dreamTeamPoints
   <chr>                          <chr>                                  <dbl>
 1 Simone                         Nalder                                    21
 2 Emelia                         Yassir                                    20
 3 Ashleigh                       Saint                                     76
 4 Chloe                          Dalton                                    44
 5 Renee                          Garing                                    92
 6 Ebony                          O'Dea                                     47
 7 Jasmin                         Stewart                                   43
 8 Sophie                         Conway                                    68
 9 Hannah                         Ewings                                    22
10 Courtney                       Jones                                     43
# ℹ 240 more rows
# ℹ 1 more variable: disposalEfficiency <dbl>
m <- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample_1)

tidy(m)
# A tibble: 2 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          36.8      5.20        7.07 1.54e-11
2 disposalEfficiency    0.215    0.0810      2.65 8.47e- 3

Okay, so we got a different intercept and coefficient estimates than we got above. This is despite the fact that I took a pure random sample from the full set of players. The only thing driving this difference is random chance.

Let’s do this again. I will take a different random sample of players and fit a regression for them:

afl_sample_2 <- afl_df |> 
  sample_n(250) |>
  select(player.player.player.givenName, 
         player.player.player.surname,
         dreamTeamPoints,
         disposalEfficiency)
afl_sample_2
# A tibble: 250 × 4
   player.player.player.givenName player.player.player.surname dreamTeamPoints
   <chr>                          <chr>                                  <dbl>
 1 Kodi                           Jacques                                   50
 2 Dana                           East                                      64
 3 Cambridge                      McCormick                                 48
 4 Rachelle                       Martin                                    74
 5 Madison                        Newman                                    94
 6 Courtney                       Jones                                     43
 7 Mackenzie                      Webb                                      30
 8 Jessica                        Low                                       36
 9 Erin                           Phillips                                  76
10 Justine                        Mules                                     35
# ℹ 240 more rows
# ℹ 1 more variable: disposalEfficiency <dbl>
m <- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample_2)

tidy(m)
# A tibble: 2 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          41.3      5.32        7.75 2.31e-13
2 disposalEfficiency    0.165    0.0827      2.00 4.71e- 2

Again, we got a different set of coefficients. This is only the result of random chance.

Let’s go again:

afl_sample_3 <- afl_df |> 
  sample_n(250) |>
  select(player.player.player.givenName, 
         player.player.player.surname,
         dreamTeamPoints,
         disposalEfficiency)
afl_sample_3
# A tibble: 250 × 4
   player.player.player.givenName player.player.player.surname dreamTeamPoints
   <chr>                          <chr>                                  <dbl>
 1 Taylor                         Smith                                     47
 2 Alice                          O'Loughlin                                46
 3 Zarlie                         Goldsworthy                              103
 4 Chelsea                        Biddell                                   58
 5 Courtney                       Hodder                                    92
 6 Mim                            Strom                                     75
 7 Jade                           Ellenger                                  92
 8 Zoe                            Wakfer                                    32
 9 Isabel                         Huntington                                37
10 Brooke                         Brown                                     58
# ℹ 240 more rows
# ℹ 1 more variable: disposalEfficiency <dbl>
m <- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample_3)

tidy(m)
# A tibble: 2 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          42.3      5.40        7.84 1.35e-13
2 disposalEfficiency    0.126    0.0838      1.50 1.34e- 1

We can see these differences:

afl_sample_1 |> 
  mutate(sample = "One") |> 
  bind_rows(
    mutate(afl_sample_2, sample = "Two")
  ) |> 
  bind_rows(
    mutate(afl_sample_3, sample = "Three")
  ) |> 
  ggplot(aes(x = disposalEfficiency, y = dreamTeamPoints, colour = sample)) + 
  geom_point(alpha = 0.5) + 
  geom_smooth(method = "lm", se = F) + 
  theme_minimal()

Let’s do this many times:

afl_sample_regression <- function(i) {
  
  afl_sample <- sample_n(afl_df, 250)
  
  m <- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_sample)
  
  model_results <- tibble(round = i, model = tidy(m))
  
  return(model_results)
  
}

afl_regressions <- map(1:1000, afl_sample_regression, .progress = T) |> 
  bind_rows()

Now we can take a look at those different regression coefficients:

afl_regressions |> 
  unnest(model)
# A tibble: 2,000 × 6
   round term               estimate std.error statistic  p.value
   <int> <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
 1     1 (Intercept)         39.1       5.25       7.45  1.51e-12
 2     1 disposalEfficiency   0.172     0.0811     2.12  3.54e- 2
 3     2 (Intercept)         45.6       5.74       7.94  6.85e-14
 4     2 disposalEfficiency   0.0719    0.0871     0.826 4.10e- 1
 5     3 (Intercept)         43.9       5.90       7.44  1.68e-12
 6     3 disposalEfficiency   0.110     0.0912     1.21  2.28e- 1
 7     4 (Intercept)         37.8       5.35       7.08  1.51e-11
 8     4 disposalEfficiency   0.190     0.0829     2.29  2.29e- 2
 9     5 (Intercept)         42.2       5.69       7.43  1.80e-12
10     5 disposalEfficiency   0.140     0.0884     1.58  1.16e- 1
# ℹ 1,990 more rows

This shows the intercept and coefficient for the relationship between disposal efficiency and Dream Team points for each of the models we generated from our 1,000 different random samples. Remember, the only thing driving the differences between each model’s intercept and coefficient is random chance.

Let’s start with the disposal efficiency coefficients:

afl_regressions |> 
  unnest(model) |> 
  filter(term == "disposalEfficiency") |> 
  ggplot(aes(x = estimate)) + 
  stat_halfeye() + 
  theme_minimal() + 
  labs(caption = "Median is shown with point. One and two standard deviations are shown by the black bars.")

Here is the distribution of those 1,000 different coefficients for disposal efficiency for each model.

Next, we can look at the intercepts:

afl_regressions |> 
  unnest(model) |> 
  filter(term == "(Intercept)") |> 
  ggplot(aes(x = estimate)) + 
  stat_halfeye() + 
  theme_minimal() + 
  labs(caption = "Median is shown with point. One and two standard deviations are shown by the black bars.")

You can see that one coefficient and estimate is common (represented by the peak of these two density plots) and that other estimates fall roughly symmetrically around this most common estimate.

In fact, if we ran an infinite number of regressions from an infinite number of random samples from our data, these coefficients would be distributed following the t-distribution.

We can use this knowledge to focus back on that single regression model that we built right at the start of this session.

m <- lm(dreamTeamPoints ~ disposalEfficiency, data = afl_df)
tidy(m)
# A tibble: 2 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          42.3      4.57        9.26 1.59e-18
2 disposalEfficiency    0.133    0.0705      1.88 6.03e- 2

Building uncertainty into our single regression model

Generally, we have a single sample and we use that sample to fit our models. We still need to account for the uncertainty (demonstrated above) created by random chance. We know that from an infinite number of samples, we would fit linear models with coefficients that follow the t-distribution. So, let’s use that knowledge to work out how certain we are of our estimates found above.

We have our point estimates for our intercept and coefficient:

tidy(m) |> 
  select(term, estimate)
# A tibble: 2 × 2
  term               estimate
  <chr>                 <dbl>
1 (Intercept)          42.3  
2 disposalEfficiency    0.133

These are our best guess at the true linear relationship between disposal efficiency and Dream Team points. Our best guess is that players who have a disposal efficiency of zero have, on average, 42.3 Dream Team points. Every one percentage point increase in a player’s efficiency is associated with a gain of 0.13 Dream Team points, on average.

Perfect. Let’s set these best guesses as our center points.

Now, we need to build out the plausible set of intercepts and coefficients that would result from an infinite number of random samples drawn from our population. To do this, we need to work out how spread out these intercepts and coefficients would be from their center points. Formally, this spread is called the standard deviation.

The standard deviation, \(s\), is calculated using two pieces of information. First, it looks at how well our line of best fit (our regression model) fits our observed data. How far are the predicted values (represented by the blue line on the below graph) from the observed values (represented by the black dots)?

ggplot(afl_df, aes(x = disposalEfficiency, y = dreamTeamPoints)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = F) + 
  theme_minimal()

We will have more uncertainty around our intercept and coefficient if the distance between the predicted values and observed values is large. This makes sense: if the predicted values and the observed values were very similar, we are unlikely to find a wildly different line of best fit from a different random sample from our population.

The standard deviation also takes into account the amount of information you have used to generate this intercept and coefficient. Greater sample sizes using less variables result in less uncertainty around those model coefficients.

In sum, the standard deviation is calculated as:

\[ s = \sqrt{\frac{\sum{e_i^2}}{n-k-1}} \]

Where \(e_i\) is the difference between each predicted value and observed value (the vertical distance between each point and the blue line on the graph above).

\(n-k-1\) is the degrees of freedom in the model. It accounts for the amount of information you used to build the model, with \(n\) equal to the number of observations you used and \(k\) equal to the number of independent variables you included. Here, our \(k=1\) because we are only using one independent variable: disposal efficiency.

We can use this spread to work out our standard errors around our coefficients. Broadly, the standard error places this uncertainty within the context of the model.

For the intercept, it is:

\[ SE_{\beta_0} = s(\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x_i-\bar{x})^2}}}) \]

For the coefficients, it is:

\[ SE_{\beta_1} = \frac{s}{\sqrt{\sum{(x_i-\bar{x})^2}}} \]

Tip

You will calculate all of this in the background. Do not worry about memorizing these formulas. I put them here to help build your intuition about what goes into your uncertainty around the relationship you find between your outcome of interest and the variables you think drive that outcome.

Happily, broom::tidy() calculates all of this for us:

tidy(m) |> 
  select(term:std.error)
# A tibble: 2 × 3
  term               estimate std.error
  <chr>                 <dbl>     <dbl>
1 (Intercept)          42.3      4.57  
2 disposalEfficiency    0.133    0.0705

Okay, so we have a good sense of the spread around our best guess. Let’s visualize this:

tibble(
  x = rnorm(1e6, 
            mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
            sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
) |> 
  ggplot(aes(x = x)) + 
  stat_halfeye() + 
  theme_minimal()

Note

There are two things to note here. The first is that I am using the normal distribution, not the t-distribution. As far as I can tell, there isn’t a way to change the center or spread of random draws from the t-distribution in R (done using rt()). So, to illustrate I am using random draws from the normal distribution (done using rnorm()). At these large degrees of freedom, these distributions are almost identical, so we can roll with this.

Second, I am pulling 1 million random draws from this distribution. In theory, we talk about the distribution of an infinite number of coefficients built from an infinite number of samples. I, obviously, cannot do this, so I use 1 million as my stand-in for infinity.

Above, we show the plausible set of coefficients describing the effect of a one percentage point increase in a player’s disposal efficiency on their Dream Team points, on average. Remember, our best guess (taken from our regression model) is that a one percentage point increase in efficiency is associated with a 0.13 increase in a player’s Dream Team points, on average. However, as we saw above when we built 1,000 different models from 1,000 different random samples, it is entirely plausible that we could get different coefficients from a model built from a different sample. These differences are the product of random chance. That’s fine! We just need to acknowledge that.

Let’s also visualize all of the plausible intercepts we could get:

tibble(
  x = rnorm(1e6, 
            mean = tidy(m) |> filter(term == "(Intercept)") |> pull(estimate),
            sd = tidy(m) |> filter(term == "(Intercept)") |> pull(std.error))
) |> 
  ggplot(aes(x = x)) + 
  stat_halfeye() + 
  theme_minimal()

Again, our best guess is that players who have a disposal efficiency of 0 have, on average, 42.3 Dream Team points. This is one of many plausible intercepts that we could generate.

Connecting this to our research question

We used our data to uncover an interesting relationship between disposal efficiency and Dream Team points, on average. As your disposal efficiency increases, so too do your Dream Team points. We found that if players who increased their disposal efficiency by one percentage point saw an increase of 0.13 Dream Team points, on average.

Now, imagine that you are a player and you want to increase your Dream Team points. Do you trust this? Do you really believe that by increasing your disposal efficiency you will increase your Dream Team points? What if, in fact, there is no relationship between these two variables?

We know from above that there are a range of plausible intercepts and coefficients that result from random chance. Does this plausible range include zero? In other words, is it plausible that there is no relationship between the outcome of interest (Dream Team points) and your independent variable (disposal efficiency)?

To answer this question, we draw on our knowledge of hypothesis testing.

First, we need to define what we mean by a plausible range of relationships. Traditionally, we are willing to accept a five percent risk that we declare that there is a relationship between an outcome and an independent variable when there is, in fact, no relationship. Let’s stick with that threshold.

Okay, so now we need to work out where the other 95 percent of plausible values sit. Let’s start with the coefficient describing the relationship between Dream Team points and disposal efficiency. Here is our representation of the coefficients drawn from an infinite number random samples pulled from our population (it is the same as printed above):

tibble(
  x = rnorm(1e6, 
            mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
            sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
) |> 
  ggplot(aes(x = x)) + 
  stat_halfeye() + 
  theme_minimal()

Where do 95 percent of these coefficients fall around our best guess? We can use our knowledge of the t-distribution to answer this question.

We know that center point (our coefficient):

beta_1 <- tidy(m) |> 
  filter(term == "disposalEfficiency") |> 
  pull(estimate)
beta_1
[1] 0.1328924

We know how spread out our alternative coefficients are from that center point (our standard error):

se_beta_1 <- tidy(m) |> 
  filter(term == "disposalEfficiency") |> 
  pull(std.error)
se_beta_1
[1] 0.0705205

And we know our threshold (5%). We need to translate that threshold into its t-statistic, accounting for the degrees of freedom we have access to:

sample_size <- nrow(afl_df)
sample_size
[1] 378
no_of_IVs <- 1
no_of_IVs
[1] 1
df <- sample_size - no_of_IVs - 1
df
[1] 376

Therefore:

t_stat_95 <- qt(0.025, df = df, lower.tail = F)
t_stat_95
[1] 1.966293

This is the point beyond which 2.5% of all values along the t-distribution fall.

Tip

Remember, we want to find out where 95 percent of these alternative coefficients sit around the center point. So, we need to distribute our remaining five percent between the two tails. That’s why we find the t-statistic beyond which 2.5% of the data fall.

Now we can find the boundaries within which 95 percent of these alternative coefficients fall:

Note

Remember:

\[ CI = \beta_1 \pm t*SE_{\beta_1} \]

lower_ci <- beta_1 - t_stat_95*se_beta_1
lower_ci
[1] -0.005771574
upper_ci <- beta_1 + t_stat_95*se_beta_1
upper_ci
[1] 0.2715564

Let’s place those in context:

tibble(
  x = rnorm(1e6, 
            mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
            sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
) |> 
  ggplot(aes(x = x)) + 
  stat_slab(aes(fill_ramp = after_stat(x < lower_ci | x > upper_ci))) + 
  theme_minimal() + 
  theme(legend.position = "none")

Brilliant! Now we know where 95 percent of all alternative coefficients drawn from a random sample of our population could fall. These are our plausible alternative coefficients.

So, do these plausible alternatives include zero? In other words, is it plausible that there is no relationship between a player’s disposal efficiency and their Dream Team points?

tibble(
  x = rnorm(1e6, 
            mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
            sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
) |> 
  ggplot(aes(x = x)) + 
  stat_slab(aes(fill_ramp = after_stat(x < lower_ci | x > upper_ci))) + 
  geom_vline(xintercept = 0) + 
  theme_minimal() + 
  theme(legend.position = "none")

Yes! A null relationship sits within our plausible set of coefficients for disposal efficiency. We cannot reject the notion that there is no relationship between disposal efficiency and Dream Team points at this threshold.

What about a more forgiving threshold? What if we are willing to accept a 10 percent chance that we reject a true null relationship?

First, we need to find this new threshold’s t-statistic:

t_stat_90 <- qt(0.05, df = df, lower.tail = F)
t_stat_90
[1] 1.648916

And; therefore, our new boundaries (within which 90 percent of alternative coefficients sit):

lower_ci <- beta_1 - t_stat_90*se_beta_1
lower_ci
[1] 0.01661001
upper_ci <- beta_1 + t_stat_90*se_beta_1
upper_ci
[1] 0.2491748

Do these contain a null relationship?

tibble(
  x = rnorm(1e6, 
            mean = tidy(m) |> filter(term == "disposalEfficiency") |> pull(estimate),
            sd = tidy(m) |> filter(term == "disposalEfficiency") |> pull(std.error))
) |> 
  ggplot(aes(x = x)) + 
  stat_slab(aes(fill_ramp = after_stat(x < lower_ci | x > upper_ci))) + 
  geom_vline(xintercept = 0) + 
  theme_minimal() + 
  theme(legend.position = "none")

No! So, if we are happy to accept a 10 percent risk that we will reject a true null relationship, we can reject the null hypothesis that there is no relationship between disposal efficiency and Dream Team points. We can tell the players that we have found a statistically significant relationship between disposal efficiency and Dream Team points at the 0.1 threshold.

We can approach this question from the other direction. If the null hypothesis were true, how likely would we be to see our estimate?

First, let’s set up our null world:

tibble(
  x = rt(1e6, df = df)
) |> 
  ggplot(aes(x = x)) + 
  stat_slab() + 
  theme_minimal() + 
  theme(legend.position = "none")

Note

I am back to random draws from our t-distribution.

Here, we have centered our distribution of alternative coefficients (resulting only from differences in our samples brought about by random chance) at zero. We are in the null world: there is no relationship between disposal efficiency and Dream Team points.

Where does the estimate we found in our sample sit within this null world? First, we need to translate that observed estimate into its t-statistic:

\[ t = \frac{\beta_1}{SE_{\beta_1}} \]

t_stat <- beta_1 / se_beta_1
t_stat
[1] 1.884451

Let’s place this in the context of the null world:

tibble(
  x = rt(1e6, df = df)
) |> 
  ggplot(aes(x = x)) + 
  stat_slab() + 
  geom_vline(xintercept = t_stat) + 
  theme_minimal() + 
  theme(legend.position = "none")

How likely is it that I would get this coefficient or a more extreme coefficient if we did, in fact, live in the null world? In other words, what proportion of these alternative coefficients (highlighted in dark gray on the graph below) are equal to or greater than our observed estimate?

Tip

Remember that we are conducting a two-tailed test of our null hypothesis. We need to be open to the estimate being greater or smaller than the null.

tibble(
  x = rt(1e6, df = df)
) |> 
  ggplot(aes(x = x)) + 
  stat_slab(aes(fill_ramp = after_stat(x < -t_stat | x > t_stat))) + 
  geom_vline(xintercept = t_stat) + 
  theme_minimal() + 
  theme(legend.position = "none")

The proportion of alternative coefficients that are equal to or more extreme than the one we observed is:

p_value <- 2 * pt(t_stat, df = df, lower.tail = F)
p_value
[1] 0.06027518

We would observe an estimate of 0.133 or a more extreme estimate 6% of the time if the null hypothesis were true. When we are only happy to accept a five percent chance that we would reject a true null hypothesis, we cannot reject that null hypothesis. If, on the other hand, we are happy to accept a 10 percent chance that we reject a true null hypothesis, we can reject it.

Reading our regression outputs

Let’s return to our original model:

tidy(m)
# A tibble: 2 × 5
  term               estimate std.error statistic  p.value
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)          42.3      4.57        9.26 1.59e-18
2 disposalEfficiency    0.133    0.0705      1.88 6.03e- 2

We now have information on every component of this output.

We can translate these estimates, in the estimate column. Our model suggests that players that have a disposal efficiency of 0 (none of their disposals result in a good outcome for their team) have, on average, 42.3 Dream Team points. Every one percentage point increase in a player’s efficiency is associated with a gain of 0.13 Dream Team points, on average.

We know that these estimates are our best guess of the true linear relationship between Dream Team points and disposal efficiency. Our best guess may be different from the true relationship because of random chance. How confident are we of that best guess? Well, we know from the standard error (std.error) how spread out around that best guess alternative coefficients sit.

We know where our estimate (translated into its t-statistic, or statistic) sits within the null world.

And finally, we know the probability (p.value) that we would observe the estimate we found if it were actually equal to zero.

All of that work we did above is replicated here, in this one line of code. In fact, we can also add our confidence intervals around our estimate to the broom::tidy() output:

tidy(m, conf.int = T)
# A tibble: 2 × 7
  term               estimate std.error statistic  p.value conf.low conf.high
  <chr>                 <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)          42.3      4.57        9.26 1.59e-18 33.3        51.3  
2 disposalEfficiency    0.133    0.0705      1.88 6.03e- 2 -0.00577     0.272

These are exactly as we manually calculated above.

What would this all look like in a published article?

modelsummary(m,
             coef_rename = c("disposalEfficiency" = "Disposal efficiency"),
             statistic = c("std.error", "p.value"))
 (1)
(Intercept) 42.329
(4.571)
(<0.001)
Disposal efficiency 0.133
(0.071)
(0.060)
Num.Obs. 378
R2 0.009
R2 Adj. 0.007
AIC 3510.0
BIC 3521.8
Log.Lik. −1751.992
F 3.551
RMSE 24.93

Increasingly, people are including confidence intervals:

modelsummary(m,
             coef_rename = c("disposalEfficiency" = "Disposal efficiency"),
             statistic = c("conf.int", "p.value"))
 (1)
(Intercept) 42.329
[33.340, 51.317]
(<0.001)
Disposal efficiency 0.133
[−0.006, 0.272]
(0.060)
Num.Obs. 378
R2 0.009
R2 Adj. 0.007
AIC 3510.0
BIC 3521.8
Log.Lik. −1751.992
F 3.551
RMSE 24.93

Or presenting their regression results graphically:

modelplot(m,
          coef_rename = c("disposalEfficiency" = "Disposal efficiency"))