GVPT Maths Camp

Learning objectives

Import your data
Clean your data
Explore relational data
Start to manage your scripts with Github

A familiar problem

A solution

Track changes to your documents, code, or data over time
Work from one document
Have access to your work from anywhere
Create safe points in case something breaks or you want to experiment

Git and Github

Open source version control software.

Think R.

A website that allows you to store your Git repositories online and makes it easy to collaborate with others.

Think RStudio.

Why should I use Git and Github? 🤔

More reproducible, transparent research
Better version control
Easy collaboration with others

The basics

Four verbs you need to know to use Git for version control:

add
commit
push
pull

Using Git in RStudio

Three different options:

RStudio GUI
Shell/terminal
Github desktop¹

Do you have Git installed?

In your terminal, run:

which git

If you get the following, yay!

/usr/bin/git

If you get something like this:

git: command not found

We need to install Git. Follow these instructions: https://happygitwithr.com/install-git.

Have you introduced yourself to Git?

library(usethis)

use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")

Check your credentials in the terminal:

git config --global --list

Have you set your Personal Access Token for HTTPS?

Run the following:

usethis::create_github_token()

It will open up a browser tab. Follow the instructions provided.

gitcreds::gitcreds_set()

# ? Enter password or token: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# -> Adding new credentials...
# -> Removing credentials from cache...
# -> Done.

Repositories

A repository is like a folder for your project, but better!
Organises your work
Displays useful information, including a general description, navigation, changes
A great tool for project-oriented workflows

Starting a new project: create a repository

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: existing R project

We already have R projects that we started yesterday.
We can sync the existing R project with our new repository.

Sync your online repository with RStudio: existing R project

The usethis R package is a brilliant helper package.

install.packages("usethis")

usethis::create_from_github(
  "https://github.com/YOU/YOUR_REPO.git",
  destdir = "~/path/to/where/you/want/the/local/repo/"
)

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Shell/terminal: Workflow

pull
add
commit
push

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Working with others

Github is like Google Docs for your code.

EXERCISE

Create a new Github repository for this camp.
Sync your existing R project to this new repository.
add your scripts from yesterday and today.
Write a helpful commit message for your future self.
push your work up to Github.
Add me as a collaborator: @hgoers.

Data wrangling

Source: R4DS

World Bank GDP data

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

Official source of global and historical GDP data
A very common control variable for IR and CP analysis
Very frustratingly messy!

World Bank GDP data

Source: World Bank

Working with external data

Introducing here::here()

install.packages("here")

Points to where you are on your computer. Updates for everyone on any computer!

here::here()

[1] "/Users/harrietgoers/Documents/intro_to_r_ps"

Reading in your CSV

I like to store raw data in a folder called data-raw within my project.
I store any clean data that is ready for analysis in a data folder within my project.

EXERCISE

Head over to the World Bank data center and find global GDP (current US$) data.
Download it as a .csv.
Store it somewhere useful in your R project.

Reading in your CSV

library(tidyverse)

read_csv(here::here("content", "slides", "data-raw", "wb_gdp.csv"))

# A tibble: 268 × 3
   `Data Source`               `World Development Indicators` ...3              
   <chr>                       <chr>                          <chr>             
 1 Last Updated Date           2023-03-01                      <NA>             
 2 Country Name                Country Code                   "Indicator Name,I…
 3 Aruba                       ABW                            "GDP (current US$…
 4 Africa Eastern and Southern AFE                            "GDP (current US$…
 5 Afghanistan                 AFG                            "GDP (current US$…
 6 Africa Western and Central  AFW                            "GDP (current US$…
 7 Angola                      AGO                            "GDP (current US$…
 8 Albania                     ALB                            "GDP (current US$…
 9 Andorra                     AND                            "GDP (current US$…
10 Arab World                  ARB                            "GDP (current US$…
# ℹ 258 more rows

EXERCISE

Read the ?read_csv help file. What arguments does this function take?
Head to the readr package documentation and find what other file types you can read in.

Skipping non-relevant rows

Source: World Bank

Skipping non-relevant rows

gdp_raw <- read_csv(
  here::here("content", "slides", "data-raw", "wb_gdp.csv"), 
  skip = 4, 
  col_select = 1:66
)

gdp_raw

# A tibble: 266 × 66
   `Country Name`      `Country Code` `Indicator Name` `Indicator Code`   `1960`
   <chr>               <chr>          <chr>            <chr>               <dbl>
 1 Aruba               ABW            GDP (current US… NY.GDP.MKTP.CD   NA      
 2 Africa Eastern and… AFE            GDP (current US… NY.GDP.MKTP.CD    2.13e10
 3 Afghanistan         AFG            GDP (current US… NY.GDP.MKTP.CD    5.38e 8
 4 Africa Western and… AFW            GDP (current US… NY.GDP.MKTP.CD    1.04e10
 5 Angola              AGO            GDP (current US… NY.GDP.MKTP.CD   NA      
 6 Albania             ALB            GDP (current US… NY.GDP.MKTP.CD   NA      
 7 Andorra             AND            GDP (current US… NY.GDP.MKTP.CD   NA      
 8 Arab World          ARB            GDP (current US… NY.GDP.MKTP.CD   NA      
 9 United Arab Emirat… ARE            GDP (current US… NY.GDP.MKTP.CD   NA      
10 Argentina           ARG            GDP (current US… NY.GDP.MKTP.CD   NA      
# ℹ 256 more rows
# ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
#   `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, …

EXERCISE

Take a look at this data set:

skimr::skim(gdp_raw)

What types of data do we have? Are they the right type of data?

Are we missing data points?

Tidy Data Structures

Source: R4DS

Tidying World Bank data

What do you want to do with your data?

I want to analyse country, regional, and global trends in GDP over time

What I need:

Annual data on each country’s GDP
The region to which each country belongs

Tidying World Bank data

To do:

Move the yearly data from columns to rows
Clean up these column names so that they are easier to use in R
Add regional data

Pivoting your data

gdp_df <- pivot_longer(
  data = gdp_raw, 
  cols = `1960`:`2021`,
  names_to = "year",
  values_to = "gdp"
)

colnames(gdp_raw)

 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"

colnames(gdp_df)

[1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
[5] "year"           "gdp"

Clean column names

Column names should not:

Have spaces
Start with numbers

Introducing janitor:

# Install the `janitor` package

install.packages("janitor")

gdp_df <- janitor::clean_names(gdp_df)

colnames(gdp_df)

[1] "country_name"   "country_code"   "indicator_name" "indicator_code"
[5] "year"           "gdp"

Add region data

Introducing the countrycode package - the indispensable workhorse of country data:

# Install the `countrycode` package

install.packages("countrycode")

Add region data

library(countrycode)

# Add each country's World Bank region to the data set

gdp_df <- gdp_df |> 
  mutate(
    region = countrycode(country_name, 
                         "country.name", 
                         "region", 
                         custom_match = c("Turkiye" = "Europe & Central Asia"))
  ) |> 
  # Remove observations that are regions
  drop_na(region) |> 
  relocate(region, .after = "country_code")

Add region data

gdp_df |> 
  distinct(country_name, region) |> 
  head(10)

# A tibble: 10 × 2
   country_name         region                    
   <chr>                <chr>                     
 1 Aruba                Latin America & Caribbean 
 2 Afghanistan          South Asia                
 3 Angola               Sub-Saharan Africa        
 4 Albania              Europe & Central Asia     
 5 Andorra              Europe & Central Asia     
 6 United Arab Emirates Middle East & North Africa
 7 Argentina            Latin America & Caribbean 
 8 Armenia              Europe & Central Asia     
 9 American Samoa       East Asia & Pacific       
10 Antigua and Barbuda  Latin America & Caribbean

Make sure all data are the right type

glimpse(gdp_df)

Rows: 13,454
Columns: 7
$ country_name   <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code   <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region         <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year           <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Make sure all data are the right type

gdp_df <- transmute(
  gdp_df,
  country_name, 
  region, 
  year = as.integer(year),
  gdp
)

glimpse(gdp_df)

Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region       <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year         <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

How do countries compare over time?

ggplot(gdp_df, aes(
  x = year, y = gdp, colour = region, group = country_name
)) + 
  geom_line() + 
  theme_minimal()

How do regions compare over time?

By default, R will carry forward NAs. This is good!

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp, na.rm = T)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

Summary

Today you:

Learnt how to read in external data
Learnt how to clean up common problems using R

HOMEWORK

Reminder

In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:

Is relevant to your research interests,
Contains continuous and discrete variables.

HOMEWORK

In case you are having some trouble:

Americanists can check out the American National Election Studies surveys.
Comparativists can check out Varieties of Democracy.
IR theorists can check out UCDP events data sets.