Data Wrangling and Github
Import your data
Clean your data
Explore relational data
Start to manage your scripts with Github
Track changes to your documents, code, or data over time
Work from one document
Have access to your work from anywhere
Create safe points in case something breaks or you want to experiment
Open source version control software.
Think R.
A website that allows you to store your Git repositories online and makes it easy to collaborate with others.
Think RStudio.
More reproducible, transparent research
Better version control
Easy collaboration with others
Four verbs you need to know to use Git for version control:
add
commit
push
pull
Three different options:
RStudio GUI
Shell/terminal
Github desktop1
In your terminal, run:
which git
If you get the following, yay!
/usr/bin/git
If you get something like this:
git: command not found
We need to install Git. Follow these instructions: https://happygitwithr.com/install-git.
Check your credentials in the terminal:
git config --global --list
Run the following:
It will open up a browser tab. Follow the instructions provided.
A repository is like a folder for your project, but better!
Organises your work
Displays useful information, including a general description, navigation, changes
A great tool for project-oriented workflows
We already have R projects that we started yesterday.
We can sync the existing R project with our new repository.
The usethis
R package is a brilliant helper package.
Github is like Google Docs for your code.
Create a new Github repository for this camp.
Sync your existing R project to this new repository.
add
your scripts from yesterday and today.
Write a helpful commit
message for your future self.
push
your work up to Github.
Add me as a collaborator: @hgoers
.
https://data.worldbank.org/indicator/NY.GDP.MKTP.CD
Official source of global and historical GDP data
A very common control variable for IR and CP analysis
Very frustratingly messy!
Introducing here::here()
Points to where you are on your computer. Updates for everyone on any computer!
I like to store raw data in a folder called data-raw
within my project.
I store any clean data that is ready for analysis in a data
folder within my project.
Head over to the World Bank data center and find global GDP (current US$) data.
Download it as a .csv.
Store it somewhere useful in your R project.
# A tibble: 268 × 3
`Data Source` `World Development Indicators` ...3
<chr> <chr> <chr>
1 Last Updated Date 2023-03-01 <NA>
2 Country Name Country Code "Indicator Name,I…
3 Aruba ABW "GDP (current US$…
4 Africa Eastern and Southern AFE "GDP (current US$…
5 Afghanistan AFG "GDP (current US$…
6 Africa Western and Central AFW "GDP (current US$…
7 Angola AGO "GDP (current US$…
8 Albania ALB "GDP (current US$…
9 Andorra AND "GDP (current US$…
10 Arab World ARB "GDP (current US$…
# ℹ 258 more rows
Read the ?read_csv
help file. What arguments does this function take?
Head to the readr
package documentation and find what other file types you can read in.
gdp_raw <- read_csv(
here::here("content", "slides", "data-raw", "wb_gdp.csv"),
skip = 4,
col_select = 1:66
)
gdp_raw
# A tibble: 266 × 66
`Country Name` `Country Code` `Indicator Name` `Indicator Code` `1960`
<chr> <chr> <chr> <chr> <dbl>
1 Aruba ABW GDP (current US… NY.GDP.MKTP.CD NA
2 Africa Eastern and… AFE GDP (current US… NY.GDP.MKTP.CD 2.13e10
3 Afghanistan AFG GDP (current US… NY.GDP.MKTP.CD 5.38e 8
4 Africa Western and… AFW GDP (current US… NY.GDP.MKTP.CD 1.04e10
5 Angola AGO GDP (current US… NY.GDP.MKTP.CD NA
6 Albania ALB GDP (current US… NY.GDP.MKTP.CD NA
7 Andorra AND GDP (current US… NY.GDP.MKTP.CD NA
8 Arab World ARB GDP (current US… NY.GDP.MKTP.CD NA
9 United Arab Emirat… ARE GDP (current US… NY.GDP.MKTP.CD NA
10 Argentina ARG GDP (current US… NY.GDP.MKTP.CD NA
# ℹ 256 more rows
# ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
# `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
# `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
# `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
# `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
# `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, …
Take a look at this data set:
What types of data do we have? Are they the right type of data?
Are we missing data points?
What do you want to do with your data?
What I need:
Annual data on each country’s GDP
The region to which each country belongs
To do:
Move the yearly data from columns to rows
Clean up these column names so that they are easier to use in R
Add regional data
[1] "Country Name" "Country Code" "Indicator Name" "Indicator Code"
[5] "1960" "1961" "1962" "1963"
[9] "1964" "1965" "1966" "1967"
[13] "1968" "1969" "1970" "1971"
[17] "1972" "1973" "1974" "1975"
[21] "1976" "1977" "1978" "1979"
[25] "1980" "1981" "1982" "1983"
[29] "1984" "1985" "1986" "1987"
[33] "1988" "1989" "1990" "1991"
[37] "1992" "1993" "1994" "1995"
[41] "1996" "1997" "1998" "1999"
[45] "2000" "2001" "2002" "2003"
[49] "2004" "2005" "2006" "2007"
[53] "2008" "2009" "2010" "2011"
[57] "2012" "2013" "2014" "2015"
[61] "2016" "2017" "2018" "2019"
[65] "2020" "2021"
Column names should not:
Have spaces
Start with numbers
Introducing janitor
:
Introducing the countrycode
package - the indispensable workhorse of country data:
library(countrycode)
# Add each country's World Bank region to the data set
gdp_df <- gdp_df |>
mutate(
region = countrycode(country_name,
"country.name",
"region",
custom_match = c("Turkiye" = "Europe & Central Asia"))
) |>
# Remove observations that are regions
drop_na(region) |>
relocate(region, .after = "country_code")
# A tibble: 10 × 2
country_name region
<chr> <chr>
1 Aruba Latin America & Caribbean
2 Afghanistan South Asia
3 Angola Sub-Saharan Africa
4 Albania Europe & Central Asia
5 Andorra Europe & Central Asia
6 United Arab Emirates Middle East & North Africa
7 Argentina Latin America & Caribbean
8 Armenia Europe & Central Asia
9 American Samoa East Asia & Pacific
10 Antigua and Barbuda Latin America & Caribbean
Rows: 13,454
Columns: 7
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
By default, R will carry forward NA
s. This is good!
Today you:
Learnt how to read in external data
Learnt how to clean up common problems using R
Reminder
In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:
Is relevant to your research interests,
Contains continuous and discrete variables.
In case you are having some trouble:
Americanists can check out the American National Election Studies surveys.
Comparativists can check out Varieties of Democracy.
IR theorists can check out UCDP events data sets.