Descriptive Statistics

Set up

Throughout this course, you will need a series of data sets I have collected, cleaned, and stored in the polisciols R package. These data sets were collected and published by political scientists (including some incredible GVPT alumni).

This package is not published on CRAN1, so you will need to install it using the following code:

install.packages("devtools")

devtools::install_github("hgoers/polisciols")

Remember, you only need to do this once on your computer. Run this in the console.

You will also need access to the following R packages to complete this week’s activities:

If you have not already done so, please install or update these packages by running the following in your console:

install.packages(c("tidyverse",
                   "wbstats",
                   "janitor",
                   "skimr",
                   "countrycode",
                   "scales"))

How will what you learn this week help your research?

You have an interesting question that you want to explore. You have some data that relate to that question. Included in these data are information on your outcome of interest and information on the things that you think determine or shape that outcome. You think that one (or more) of the drivers is particularly important, but no one has yet written about it or proven its importance. Brilliant! What do you do now?

The first step in any empirical analysis is getting to know your data. I mean, really getting to know your data. You want to dig into it with a critical eye. You want to understand any patterns lurking beneath the surface.

Ultimately, you want to get a really good understanding of the data generation process. This process can be thought of in two different and important ways. First, you want to understand how, out there in the real world, your outcome and drivers come to be. For example, if you are interested in voting patterns, you want to know the nitty gritty process of how people actually vote. Do they have to travel long distances, stand in long queues, fill out a lot of paperwork? Are there age restrictions on their ability to vote? Are there more insidious restrictions that might suppress voting for one particular group in the electorate?

You can use the skills we will discuss in this section to help you answer these questions. For example, you can determine whether there are relatively few young voters compared to older voters. If so, why? In turn, your growing expertise in and understanding of the data generation process should inform your exploration of the data. You might note that people have to wait in long queues on a Tuesday to vote. Does this impact the number of workers vs. retirees who vote?

Now, this is made slightly more tricky by the second part of this process. You need to understand how your variables are actually measured. How do we know who turns out to vote? Did you get access to the voter file, which records each individual who voted and some interesting and potentially relevant demographic information about them? Or are you relying on exit polls, that only include a portion of those who voted? Were the people included in the polls reflective of the total voting population? What or whom is missing from this survey? Of course, if your sample is not representative, you might find some patterns that appear to be very important to your outcome of interest but are, in fact, just an artifact of a poorly drawn sample. If your survey failed to get responses from young people, you may be led to falsely believe that young people don’t vote.

This week you will be introduced to the first part of the data analysis process: data exploration. We use descriptive statistics to describe patterns in our data. These are incredibly powerful tools that will arm you with an intimate knowledge of the shape of your variables of interest. With this knowledge, you will be able to start to answer your important question and potentially identify new ones. You will also be able to sense-check your more complex models and pick up on odd or incorrect relationships that they may find.

As you make your frequency tables and histograms and very elaborate dot plots and box charts, keep in mind that these tools are useful for your interrogation of the data generation process. Be critical. Continue to ask whether your data allow you to detect true relationships between your variables of interest. Build your intuition for what is really going on and what factors are really driving your outcome of interest.

Let’s get started.

Describing your data

Broadly, there are two types of variables: categorical and continuous variables.

Categorical variables are discrete. They can be unordered (nominal) - for example, the different colours of cars - or ordered (ordinal) - for example, whether you strongly dislike, dislike, are neutral about, like, or strongly like Taylor Swift.

Note

Dichotomous (or binary) variables are a special type of categorical variable. They take on one of two values. For example: yes or no; at war or not at war; is a Swifty, or is not a Swifty.

Continuous variables are, well, continuous. For example, your height or weight, a country’s GDP or population, or the number of fatalities in a battle.

Note

Continuous variables can be made into (usually ordered) categorical variables. This process is called binning. For example, you can take individuals’ ages and reduce them to 0 - 18 years old, 18 - 45 years old, 45 - 65 years old, and 65+ years old.

You lose information in this process: you cannot go from 45 - 65 years old back to the individuals’ precise age. In other words, you cannot go from a categorical to continuous variable.

Let’s take a look at how you can describe these different types of variables using real-world political science examples.

Describing categorical variables

Generally, we can get a good sense of a categorical variable by looking at counts or proportions. For example, which category contains the most number of observations? Which contains the least?

Note

Later, we will ask interesting questions using these summaries. These include whether differences between the counts and/or percentages of cases that fall into each category are meaningfully (and/or statistically significantly) different from one another. This deceptively simple question serves as the foundation for a lot of political science research.

Let’s use the American National Election Survey to explore how to produce useful descriptive statistics for categorical variables using R. The ANES surveys individual US voters prior to and just following US Presidential Elections. It surveys them about their political beliefs and behavior.

We can access the latest survey (from the 2020 Presidential Election) using the polisciols package:

polisciols::nes
Exercise

Take a look at the different pieces of information collected about each respondent by running ?nes in your console.

Let’s look at US voters’ views on income inequality in the US. Specifically, we will look at whether individuals think the difference in incomes between rich people and poor people in the United States today is larger, smaller, or about the same as it was 20 years ago.

Respondents could provide one of four answers (or refuse to answer the question):

distinct(nes, income_gap)
# A tibble: 5 × 1
  income_gap    
  <ord>         
1 About the same
2 Larger        
3 Smaller       
4 <NA>          
5 Don't know    

This is an ordinal categorical variable. It is discrete and ordered. We can take a look at the variable itself using the helpful skimr::skim() function:

skim(nes$income_gap)
Data summary
Name nes$income_gap
Number of rows 8280
Number of columns 1
_______________________
Column type frequency:
factor 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
data 54 0.99 TRUE 4 Lar: 6117, Abo: 1683, Sma: 416, Don: 10

From this, we learn that:

  • The variable type is a factor (see the R tip below)
  • We are missing 54 observations (in other words, 54 people did not answer the question)
  • This means that we have information on 99% of our observations (from complete_rate).
Tip

Remember, there are many different types of data that R recognizes. These include characters ("A", "B", "C"), integers (1, 2, 3), and logical values (TRUE or FALSE). R treats categorical variables as factors.

Frequency distribution

We can take advantage of janitor::tabyl() to quickly calculate the number and proportion of respondents in each age bracket.

tabyl(nes, income_gap)
     income_gap    n     percent valid_percent
     Don't know   10 0.001207729   0.001215658
        Smaller  416 0.050241546   0.050571359
 About the same 1683 0.203260870   0.204595186
         Larger 6117 0.738768116   0.743617797
           <NA>   54 0.006521739            NA
Tip

valid_percent provides the proportion of respondents who provided each answer with missing values removed from the denominator. For example, the ANES surveyed 8,280 respondents in 2020, but only 8,226 of them answered this question.

6,117 responded that they believe the income gap is larger today than it was 20 years ago. Therefore, the Larger proportion (which is bounded by 0 and 1, whereas percentages are bounded by 0 and 100) is 6,117 / 8,280 and its valid proportion is 6117 / 8,226.

Visualizing this frequency

It is a bit difficult to quickly determine relative counts. Which was the most popular answer? Which was the least? Are these counts very different from each other?

Visualizing your data will give you a much better sense of it. I recommend using a bar chart to show clearly relative counts.

ggplot(nes, aes(y = income_gap)) + 
  geom_bar() +
  theme_minimal() + 
  theme(plot.title = element_text(face = "bold"),
        plot.title.position = "plot") + 
  labs(
    title = "Do you think the difference in incomes between rich people and poor people in the United States today is larger, \nsmaller, or about the same as it was 20 years ago?", 
    x = "Count of respondents",
    y = NULL,
    caption = "Source: ANES 2020 Survey"
  ) + 
  scale_x_continuous(labels = scales::label_comma())

Tip

geom_bar() automatically counts the number of observations in each category.

From this plot we quickly learn that a large majority of respondents believe that the income gap has grown over the last 20 years. Very few people believe it has shrunk.

Describing continuous variables

We need to treat continuous variables differently from categorical ones. Continuous variables cannot meaningfully be bound together and compared. For example, imagine making a frequency table or bar chart that counts the number of countries with each observed GDP. You would have 193 different counts of one. Not very helpful!

We can get a much better sense of our continuous variables by looking at how they are distributed across the range of all possible values they could take on. Phew! Let’s make sense of this using some real-world data.

For this section, we will look at how much each country spends on education as a proportion of its gross domestic product (GDP). We will use wbstats::wb_data() to collect these data.

perc_edu <- wb_data(
  "SE.XPD.TOTL.GD.ZS", start_date = 2020, end_date = 2020, return_wide = F
) |> 
  transmute(
    country, 
    region = countrycode(country, "country.name", "region"),
    year = date,
    value
  )

perc_edu
# A tibble: 217 × 4
   country             region                      year value
   <chr>               <chr>                      <dbl> <dbl>
 1 Afghanistan         South Asia                  2020 NA   
 2 Albania             Europe & Central Asia       2020  3.34
 3 Algeria             Middle East & North Africa  2020  6.19
 4 American Samoa      East Asia & Pacific         2020 NA   
 5 Andorra             Europe & Central Asia       2020  2.63
 6 Angola              Sub-Saharan Africa          2020  2.67
 7 Antigua and Barbuda Latin America & Caribbean   2020  2.99
 8 Argentina           Latin America & Caribbean   2020  5.28
 9 Armenia             Europe & Central Asia       2020  2.71
10 Aruba               Latin America & Caribbean   2020 NA   
# ℹ 207 more rows
Note

I have added each country’s region (using countrycode::countrycode()) so that we can explore regional trends in our data.

We can get a good sense of how expenditure varied by country by looking at the center, spread, and shape of the distribution.

Visualizing continuous distributions

First, let’s plot each country’s spending to see how they relate to one another. There are two plot types commonly used for this: histograms and density curves.

Histograms

A histogram creates buckets along the range of values our variable can take (i.e. buckets of 10 between 1 and 100 would include 1 - 10, 11 - 20, 21 - 30). It then counts the number of observations that fall into each of those buckets and plots that count.

Let’s plot our data as a histogram with a bin width of 1 percentage point:

ggplot(perc_edu, aes(x = value)) + 
  geom_histogram(binwidth = 1) + 
  theme_minimal() + 
  labs(
    x = "Expenditure on education as a proportion of GDP",
    y = "Number of countries"
  )

From this we learn that most countries spend between three to five percent of their GDP on education. There appears to be an outlier: a country that spends around 10 percent of its GDP on education.

If we pick a narrower bin width, we will see more fine-grained detail about the distribution of our data:

ggplot(perc_edu, aes(x = value)) + 
  geom_histogram(binwidth = 0.25) + 
  theme_minimal() + 
  labs(
    x = "Expenditure on education as a proportion of GDP",
    y = "Number of countries"
  )

From this we learn that there most countries spend around four percent of their GDP on education. There is a small cluster of countries that spend between around 7.5 to nine percent on these services.

Density curves

Density curves also communicate the distribution of continuous variables. They plot the density of the data that fall at a given value on the x-axis.

Let’s plot our data using a density plot:

ggplot(perc_edu, aes(x = value)) + 
  geom_density() + 
  theme_minimal() + 
  labs(
    x = "Expenditure on education as a proportion of GDP",
    y = "Density"
  )

This provides us with the same information above, but highlights the broader shape of our distribution. We again learn that most countries spend around four percent of their GDP on education. There are some that spend above 7.5 percent.

Understanding distributions

We can use the shape of a variable’s distribution to compare our variables of interest to other variables. Is the distribution symmetric or skewed? Where are the majority of observations clustered? Are there multiple distinct clusters, or high points, in the distribution?

There are three broad distributions that you should know: normal, right-skewed, and left-skewed. People use these terms to summarize the shape of their continuous data.

Normal distribution

A normally distributed variable includes values that fall symmetrically away from their center point, which is the peak (or most common value).

Examples of normally distributed data include the height or weight of all individuals in a large population.

Note

This distribution is also referred to as a bell-curve.

Right-skewed distribution

With right-skewed data, the majority of data have small values with a small number of larger values.

Examples of right-skewed data include countries’ GDP.

Left-skewed distribution

With left-skewed data, the majority of data have large values with a small number of small values.

Examples of left-skewed data include democracies’ election turn-out rates.

Measures of central tendency: mean, median, and mode

We can also use measures of central tendency to quickly describe and compare our variables.

Mean

The mean is the average of all values. Formally:

\[ \bar{x} = \frac{\Sigma x_i}{n} \]

In other words, add all of your values together and then divide that total by the number of values you have.

In R:

mean(perc_edu$value, na.rm = T)
[1] 4.509317
Tip

If you do not use the argument na.rm (read “NA remove!”), you will get an NA if any exist in your vector of values. This is a good default! You should be very aware of missing data points.

On average, countries spent 4.51% of their GDP on education in 2020.

Median

The median is the mid-point of all values.

To calculate it, put all of your values in order from smallest to largest. Identify the value in the middle. That’s your median.

In R:

median(perc_edu$value, na.rm = T)
[1] 4.446204

The median country spent 4.45% of their GDP on education in 2020.

Mode

The mode is the most frequent of all values.

To calculate it, count how many times each value occurs in your data set. The one that occurs the most is your mode.

Note

This is usually a more useful summary statistic for categorical variables than continuous ones. For example, which colour of car is most popular? Which political party has the most members?

In R:

x <- c(1, 1, 2, 4, 5, 32, 5, 1, 10, 3, 4, 6, 10)

table(x)
x
 1  2  3  4  5  6 10 32 
 3  1  1  2  2  1  2  1 

Using central tendency to describe and understand distributions

Normally distributed values have the same mean and median.

For right skewed data, the mean is greater than the median.

For left skewed data, the mean is smaller than the median.

Note

When do we care about the mean or the median? There is no simple answer to this question. Both of these values are useful summaries of our continuous data. By default, we use the average to describe our data in statistical analysis. As you will learn, most regression models are, fundamentally, just fancy averages of our data. However, this default is not always sensible.

As you may have noted above, the average value is more sensitive to extreme values. If you have one very large or very small number in your vector of numbers, your average will be pulled well away from your mid-point (or median). This can lead you astray. To illustrate, let’s look at the average and median of the numbers between one and 10:

x <- 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10
mean(x)
[1] 5.5
[1] 5.5

If we add one very large number to our vector, our average will shoot up but our median will only move up one additional number in our collection:

x <- c(x, 1000)
x
 [1]    1    2    3    4    5    6    7    8    9   10 1000
mean(x)
[1] 95.90909
[1] 6

Which number better summarizes our data? Here, I would suggest that the average is misleading. That one 1,000 data point is doing a lot of the work. The median better describes the majority of my data.

We will talk more about this (and outliers more specifically) throughout the semester.

Five number summary

As you can see, we are attempting to summarize our continuous data to give us a meaningful but manageable sense of it. Means and medians are useful for continuous data.

We can provide more context to our understanding using more summary statistics. A common approach is the five number summary. This includes:

  • The smallest value;

  • The 25th percentile value, or the median of the lower half of the data;

  • The median;

  • The 75th percentile value, or the median of the upper half of the data;

  • The largest value.

We can use skimr::skim() to quickly get useful information about our continuous variable.

skim(perc_edu$value)
Data summary
Name perc_edu$value
Number of rows 217
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 52 0.76 4.51 1.75 0.36 3.3 4.45 5.53 10.54 ▂▇▇▂▁

We have 217 rows (because our unit of observation is a country, we can read this as 217 countries2). We are missing education spending values for 52 of those countries (see n_missing), giving us a complete rate of 76% (see complete_rate).

The country that spent the least on education as a percent of its GDP in 2020 was Nigeria, which spent 0.4% (see p0). The country that spent the most was the Micronesia, Fed. Sts., which spent 10.5% (see p100). The average percent of GDP spent on education in 2020 was 4.5% (see mean) and the median was 4.4% (see p50).

This description was a bit unwieldy. As usual, to get a better sense of our data we should visualize it.

Box plots

Box plots (sometimes referred to as box and whisker plots) visualize the five number summary (with bonus features) nicely.

ggplot(perc_edu, aes(x = value)) + 
  geom_boxplot() + 
  theme_minimal() + 
  theme(
    axis.text.y = element_blank()
  ) + 
  labs(
    x = "Expenditure on education as a proportion of GDP",
    y = NULL
  )

The box in the graph above displays the 25th percentile, the median, and the 75th percentile values. The tails show you all the data up to a range 1.5 times the interquartile range (IQR), or the 75th percentile minus the 25th percentile (or the upper edge of the box minus the lower edge of the box). If the smallest or largest values fall below or above (respectively) 1.5 times the IQR, the tail ends at that value. The remaining data points (if they exist) are displayed as dots shooting away from the whiskers of our box and whisker plot.

Outliers

Note that some countries’ expenditure are displayed as dots. The box plot above is providing you with a bit more information than the five number summary alone. If the data include values that fall outside of the IQR, they are displayed as dots. These are (very rule of thumb, take with a grain of salt, please rely on your theory and data generation process instead!) candidates for outliers.

Outliers fall so far away from the majority of the other values that they should be examined closely and perhaps excluded from your analysis. As discussed above, they can distort your mean. They do not, however, distort your median.

Note

We will talk more about how to deal with outliers later in the course.

Measures of spread: range, variance, and standard deviation

We now have a good sense of some of the features of our data. Another useful thing to know is the shape of the distribution. Here, measures of spread are useful.

Range

The range is the difference between the largest and smallest value.

\[ range = max - min \]

max(perc_edu$value, na.rm = T) - min(perc_edu$value, na.rm = T)
[1] 10.17936

The difference between the country that spends the highest proportion of its GDP on education and that which spends the least is 10.18 percentage points.

Variance

The variance measures how spread out your values are. On average, how far are your observations from the mean?

This measure can, at first, be a bit too abstract to get an immediate handle on. Let’s walk through it. Imagine we have two data sets, wide_dist and narrow_dist. Both are normally distributed, share the same mean (0), and the same number of observations (1,000,000).

wide_dist
# A tibble: 1,000,000 × 1
        x
    <dbl>
 1  1.47 
 2  4.53 
 3 -2.08 
 4 -0.715
 5 -4.01 
 6 -2.50 
 7  0.465
 8 -0.251
 9 -0.700
10 -0.529
# ℹ 999,990 more rows
narrow_dist
# A tibble: 1,000,000 × 1
         x
     <dbl>
 1  0.204 
 2 -0.283 
 3 -0.597 
 4 -0.0543
 5 -0.533 
 6  1.13  
 7  1.15  
 8 -0.238 
 9  0.545 
10 -0.517 
# ℹ 999,990 more rows

Let’s plot them:

Despite both having the same center point and number of observations, the data are much more spread out around that center point in the top graph (of wide_dist).

The data in the top graph have higher variance (are more spread out) than those in the bottom graph. We measure this by calculating the average of the squares of the deviations of the observations from their mean.

\[ s^2 = \frac{\Sigma(x_i - \bar{x})^2}{n - 1} \]

Let’s step through this. We will first calculate the variance of wide_dist. To do this:

  1. Calculate the mean of your values.

  2. Calculate the difference between each individual value and that mean (how far from the mean is every value?).

  3. Square those differences.

Tip

We do not care whether the value is higher or lower than the mean. We only care how far from the mean it is. Squaring a value removes its sign (positive or negative). Remember, if you multiply a negative number by a negative number, you get a positive number. This allows us to concentrate on the difference between each individual data point and the mean.

  1. Add all of those squared differences to get a single number.

  2. Divide that single number by the number of observations you have minus 1.

You now have your variance!

In R:

wide_dist_mean <- mean(wide_dist$x)

wide_var_calc <- wide_dist |> 
  mutate(
    mean = wide_dist_mean,
    diff = x - mean,
    diff_2 = diff^2
  )

wide_var_calc
# A tibble: 1,000,000 × 4
        x     mean   diff  diff_2
    <dbl>    <dbl>  <dbl>   <dbl>
 1  1.47  -0.00229  1.47   2.16  
 2  4.53  -0.00229  4.53  20.5   
 3 -2.08  -0.00229 -2.08   4.33  
 4 -0.715 -0.00229 -0.712  0.507 
 5 -4.01  -0.00229 -4.01  16.0   
 6 -2.50  -0.00229 -2.49   6.22  
 7  0.465 -0.00229  0.467  0.218 
 8 -0.251 -0.00229 -0.249  0.0620
 9 -0.700 -0.00229 -0.698  0.487 
10 -0.529 -0.00229 -0.527  0.278 
# ℹ 999,990 more rows

We the add those squared differences between each observation and the mean of our whole sample together. Finally, we divide that by one less than our number of observations.

wide_var <- sum(wide_var_calc$diff_2) / (nrow(wide_var_calc) - 1)

wide_var
[1] 4.000054

We can compare this to the variance for our narrower distribution.

narrow_var_calc <- narrow_dist |> 
  mutate(
    mean = mean(narrow_dist$x),
    diff = x - mean,
    diff_2 = diff^2
  )

narrow_var <- sum(narrow_var_calc$diff_2) / (nrow(narrow_var_calc) - 1)

narrow_var
[1] 1.001548

It is, in fact, smaller!

That was painful. Happily we can use var() to do this in one step:

var(wide_dist)
         x
x 4.000054
var(narrow_dist)
         x
x 1.001548
var(wide_dist) > var(narrow_dist)
     x
x TRUE

On average, countries spent 3.06% more or less than the average of 4.51% of their GDP on education in 2020.

Standard deviation

A simpler measure of spread is the standard deviation. It is simply the square root of the variance.

sqrt(wide_var)
[1] 2.000014
sqrt(narrow_var)
[1] 1.000774

You can get this directly using sd():

sd(wide_dist$x)
[1] 2.000014
sd(narrow_dist$x)
[1] 1.000774

The standard deviation of all countries’ percentage of their GDP that they spent on education in 2020 was 1.75%. This horrible sentence demonstrates that standard deviations are most usefully employed in contexts other than attempts to better understand your variables of interest. They are very important for determining how certain we can be about the relationships between different variables we uncover using statistical models (which we will get to later in the semester).

Conclusion

Your empirical analysis is only as strong as its foundation. You can use the tools you learnt this week to build a very strong foundation. Always start any analysis by getting a very strong sense of your data. Look at it with a critical eye. Does it match your intuition? Is something off? What can you learn about the peaks and troughs among your observations?

Footnotes

  1. The Comprehensive R Archive Network (CRAN) hosts many R packages that can be installed easily using the familiar install.packages() function. These packages have gone through a comprehensive quality assurance process. I wrote polisciols for this class and will update it regularly. I, therefore, will not host it through CRAN: the quality assurance process takes too long to be practical for our weekly schedule. Instead, you are downloading it directly from its Github repository.↩︎

  2. You are right: there were not nrow(perc_edu) countries in 2020. The World Bank collects data on some countries that are not members of the UN (and would not, traditionally, be considered to be countries).↩︎