GVPT Maths Camp

Data Wrangling and Github

Learning objectives

  1. Import your data

  2. Clean your data

  3. Explore relational data

  4. Start to manage your scripts with Github

A familiar problem

A solution

  • Track changes to your documents, code, or data over time

  • Work from one document

  • Have access to your work from anywhere

  • Create safe points in case something breaks or you want to experiment

Git and Github

Open source version control software.


Think R.

A website that allows you to store your Git repositories online and makes it easy to collaborate with others.


Think RStudio.

Why should I use Git and Github? 🤔

  • More reproducible, transparent research

  • Better version control

  • Easy collaboration with others

The basics

Four verbs you need to know to use Git for version control:

  1. add

  2. commit

  3. push

  4. pull

Using Git in RStudio

Three different options:

  1. RStudio GUI

  2. Shell/terminal

  3. Github desktop1

Do you have Git installed?

In your terminal, run:

which git

If you get the following, yay!

/usr/bin/git

If you get something like this:

git: command not found

We need to install Git. Follow these instructions: https://happygitwithr.com/install-git.

Have you introduced yourself to Git?

library(usethis)

use_git_config(user.name = "Jane Doe", user.email = "jane@example.org")

Check your credentials in the terminal:

git config --global --list

Have you set your Personal Access Token for HTTPS?

Run the following:

usethis::create_github_token()

It will open up a browser tab. Follow the instructions provided.

gitcreds::gitcreds_set()

# ? Enter password or token: ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# -> Adding new credentials...
# -> Removing credentials from cache...
# -> Done.

Repositories

  • A repository is like a folder for your project, but better!

  • Organises your work

  • Displays useful information, including a general description, navigation, changes

  • A great tool for project-oriented workflows

Starting a new project: create a repository

Starting a new project: create a repository

Starting a new project: create a repository

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: from scratch

Sync your online repository with RStudio: existing R project

  • We already have R projects that we started yesterday.

  • We can sync the existing R project with our new repository.

Sync your online repository with RStudio: existing R project

The usethis R package is a brilliant helper package.

install.packages("usethis")


usethis::create_from_github(
  "https://github.com/YOU/YOUR_REPO.git",
  destdir = "~/path/to/where/you/want/the/local/repo/"
)

RStudio GUI: Workflow

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Shell/terminal: Workflow

pull any changes made and stored in your Github repository before making your changes

add those changes to your staging area

commit your changes with a meaningful message

push those committed changes up to Github

Working with others

Github is like Google Docs for your code.

EXERCISE

  1. Create a new Github repository for this camp.

  2. Sync your existing R project to this new repository.

  3. add your scripts from yesterday and today.

  4. Write a helpful commit message for your future self.

  5. push your work up to Github.

  6. Add me as a collaborator: @hgoers.

Data wrangling

Source: R4DS

World Bank GDP data

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

  • Official source of global and historical GDP data

  • A very common control variable for IR and CP analysis

  • Very frustratingly messy!

World Bank GDP data

Source: World Bank

Working with external data

Introducing here::here()

install.packages("here")


Points to where you are on your computer. Updates for everyone on any computer!

here::here()
[1] "/Users/harrietgoers/Documents/intro_to_r_ps"

Reading in your CSV

  • I like to store raw data in a folder called data-raw within my project.

  • I store any clean data that is ready for analysis in a data folder within my project.

EXERCISE

  1. Head over to the World Bank data center and find global GDP (current US$) data.

  2. Download it as a .csv.

  3. Store it somewhere useful in your R project.

Reading in your CSV

library(tidyverse)

read_csv(here::here("content", "slides", "data-raw", "wb_gdp.csv"))
# A tibble: 268 × 3
   `Data Source`               `World Development Indicators` ...3              
   <chr>                       <chr>                          <chr>             
 1 Last Updated Date           2023-03-01                      <NA>             
 2 Country Name                Country Code                   "Indicator Name,I…
 3 Aruba                       ABW                            "GDP (current US$…
 4 Africa Eastern and Southern AFE                            "GDP (current US$…
 5 Afghanistan                 AFG                            "GDP (current US$…
 6 Africa Western and Central  AFW                            "GDP (current US$…
 7 Angola                      AGO                            "GDP (current US$…
 8 Albania                     ALB                            "GDP (current US$…
 9 Andorra                     AND                            "GDP (current US$…
10 Arab World                  ARB                            "GDP (current US$…
# ℹ 258 more rows


EXERCISE

  1. Read the ?read_csv help file. What arguments does this function take?

  2. Head to the readr package documentation and find what other file types you can read in.

Skipping non-relevant rows

Source: World Bank

Skipping non-relevant rows

gdp_raw <- read_csv(
  here::here("content", "slides", "data-raw", "wb_gdp.csv"), 
  skip = 4, 
  col_select = 1:66
)

gdp_raw
# A tibble: 266 × 66
   `Country Name`      `Country Code` `Indicator Name` `Indicator Code`   `1960`
   <chr>               <chr>          <chr>            <chr>               <dbl>
 1 Aruba               ABW            GDP (current US… NY.GDP.MKTP.CD   NA      
 2 Africa Eastern and… AFE            GDP (current US… NY.GDP.MKTP.CD    2.13e10
 3 Afghanistan         AFG            GDP (current US… NY.GDP.MKTP.CD    5.38e 8
 4 Africa Western and… AFW            GDP (current US… NY.GDP.MKTP.CD    1.04e10
 5 Angola              AGO            GDP (current US… NY.GDP.MKTP.CD   NA      
 6 Albania             ALB            GDP (current US… NY.GDP.MKTP.CD   NA      
 7 Andorra             AND            GDP (current US… NY.GDP.MKTP.CD   NA      
 8 Arab World          ARB            GDP (current US… NY.GDP.MKTP.CD   NA      
 9 United Arab Emirat… ARE            GDP (current US… NY.GDP.MKTP.CD   NA      
10 Argentina           ARG            GDP (current US… NY.GDP.MKTP.CD   NA      
# ℹ 256 more rows
# ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
#   `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, …

EXERCISE

Take a look at this data set:

skimr::skim(gdp_raw)


What types of data do we have? Are they the right type of data?


Are we missing data points?

Tidy Data Structures

Source: R4DS

Tidying World Bank data

What do you want to do with your data?

  • I want to analyse country, regional, and global trends in GDP over time

What I need:

  • Annual data on each country’s GDP

  • The region to which each country belongs

Tidying World Bank data

To do:

  • Move the yearly data from columns to rows

  • Clean up these column names so that they are easier to use in R

  • Add regional data

Pivoting your data

gdp_df <- pivot_longer(
  data = gdp_raw, 
  cols = `1960`:`2021`,
  names_to = "year",
  values_to = "gdp"
)


colnames(gdp_raw)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"          
colnames(gdp_df)
[1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
[5] "year"           "gdp"           

Clean column names

Column names should not:

  • Have spaces

  • Start with numbers

Introducing janitor:

# Install the `janitor` package

install.packages("janitor")


gdp_df <- janitor::clean_names(gdp_df)

colnames(gdp_df)
[1] "country_name"   "country_code"   "indicator_name" "indicator_code"
[5] "year"           "gdp"           

Add region data

Introducing the countrycode package - the indispensable workhorse of country data:

# Install the `countrycode` package

install.packages("countrycode")

Add region data

library(countrycode)

# Add each country's World Bank region to the data set

gdp_df <- gdp_df |> 
  mutate(
    region = countrycode(country_name, 
                         "country.name", 
                         "region", 
                         custom_match = c("Turkiye" = "Europe & Central Asia"))
  ) |> 
  # Remove observations that are regions
  drop_na(region) |> 
  relocate(region, .after = "country_code")

Add region data

gdp_df |> 
  distinct(country_name, region) |> 
  head(10)
# A tibble: 10 × 2
   country_name         region                    
   <chr>                <chr>                     
 1 Aruba                Latin America & Caribbean 
 2 Afghanistan          South Asia                
 3 Angola               Sub-Saharan Africa        
 4 Albania              Europe & Central Asia     
 5 Andorra              Europe & Central Asia     
 6 United Arab Emirates Middle East & North Africa
 7 Argentina            Latin America & Caribbean 
 8 Armenia              Europe & Central Asia     
 9 American Samoa       East Asia & Pacific       
10 Antigua and Barbuda  Latin America & Caribbean 

Make sure all data are the right type

glimpse(gdp_df)
Rows: 13,454
Columns: 7
$ country_name   <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code   <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region         <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year           <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Make sure all data are the right type

gdp_df <- transmute(
  gdp_df,
  country_name, 
  region, 
  year = as.integer(year),
  gdp
)

glimpse(gdp_df)
Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region       <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year         <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

How do countries compare over time?

ggplot(gdp_df, aes(
  x = year, y = gdp, colour = region, group = country_name
)) + 
  geom_line() + 
  theme_minimal()

How do regions compare over time?

By default, R will carry forward NAs. This is good!

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp, na.rm = T)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

Summary

Today you:

  • Learnt how to read in external data

  • Learnt how to clean up common problems using R

HOMEWORK

Reminder

In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:

  1. Is relevant to your research interests,

  2. Contains continuous and discrete variables.

HOMEWORK

In case you are having some trouble: