[1] 12
Data Transformation
Learn basic operations in R
Be introduced to dplyr
Clean up and transform your data
Create new objects with <-
Many functions come with R straight out of the box:
You can create objects using functions:
First, you need to install the gapminder
package:
Then access the gapminder
data set:
# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
In gapminder
:
fctr
stands for factors, which R uses to represent categorical variables with fixed possible values.
int
stands for integer.
dbl
stands for doubles (or real numbers).
Other types:
chr
stands for character vectors, or strings.
dttm
stands for date-times (a date + a time).
lgl
stands for logical, vectors that contain only TRUE
or FALSE
.1
dplyr
Help you with most of your data transformation needs.
Five basic functions:
filter()
arrange()
select()
mutate()
summarise()
filter()
dplyr
basicsFirst argument is always a data frame.
Subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).
Output is always a new data frame.
filter()
# A tibble: 420 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 410 more rows
filter()
# A tibble: 88 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Bahrain Asia 1992 72.6 529491 19036.
2 Bahrain Asia 1997 73.9 598561 20292.
3 Bahrain Asia 2002 74.8 656397 23404.
4 Bahrain Asia 2007 75.6 708573 29796.
5 Botswana Africa 1962 51.5 512764 984.
6 Botswana Africa 1967 53.3 553541 1215.
7 Botswana Africa 1972 56.0 619351 2264.
8 Botswana Africa 1977 59.3 781472 3215.
9 Botswana Africa 1982 61.5 970347 4551.
10 Comoros Africa 1997 60.7 527982 1174.
# ℹ 78 more rows
filter()
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
Find all country-years that have populations greater than 1 billion people.
Find all countries in Oceania.
Find all countries in both Asia and Europe.
Find all country-years that have a life expectancy greater than 50 years and less than 60 years.
Find all country-years that have a life expectancy less than 50 years or greater than 60 years.
# A tibble: 8 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 1982 65.5 1000281000 962.
2 China Asia 1987 67.3 1084035000 1379.
3 China Asia 1992 68.7 1164970000 1656.
4 China Asia 1997 70.4 1230075000 2289.
5 China Asia 2002 72.0 1280400000 3119.
6 China Asia 2007 73.0 1318683096 4959.
7 India Asia 2002 62.9 1034172547 1747.
8 India Asia 2007 64.7 1110396331 2452.
# A tibble: 24 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Australia Oceania 1952 69.1 8691212 10040.
2 Australia Oceania 1957 70.3 9712569 10950.
3 Australia Oceania 1962 70.9 10794968 12217.
4 Australia Oceania 1967 71.1 11872264 14526.
5 Australia Oceania 1972 71.9 13177000 16789.
6 Australia Oceania 1977 73.5 14074100 18334.
7 Australia Oceania 1982 74.7 15184200 19477.
8 Australia Oceania 1987 76.3 16257249 21889.
9 Australia Oceania 1992 77.6 17481977 23425.
10 Australia Oceania 1997 78.8 18565243 26998.
# ℹ 14 more rows
# A tibble: 756 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 746 more rows
# A tibble: 336 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Albania Europe 1952 55.2 1282697 1601.
2 Albania Europe 1957 59.3 1476505 1942.
3 Algeria Africa 1967 51.4 12760499 3247.
4 Algeria Africa 1972 54.5 14760787 4183.
5 Algeria Africa 1977 58.0 17152804 4910.
6 Bahrain Asia 1952 50.9 120447 9867.
7 Bahrain Asia 1957 53.8 138655 11636.
8 Bahrain Asia 1962 56.9 171863 12753.
9 Bahrain Asia 1967 59.9 202182 14805.
10 Bangladesh Asia 1982 50.0 93074406 677.
# ℹ 326 more rows
# A tibble: 1,368 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,358 more rows
==
is equal to
!=
is not equal to
>=
is greater than or equal to
<=
is less than or equal to
|
is OR
&
is AND
%in%
is in
arrange()
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
7 Afghanistan Asia 1982 39.9 12881816 978.
8 Afghanistan Asia 1987 40.8 13867957 852.
9 Afghanistan Asia 1992 41.7 16317921 649.
10 Afghanistan Asia 1997 41.8 22227415 635.
# ℹ 1,694 more rows
arrange()
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 2007 43.8 31889923 975.
2 Afghanistan Asia 2002 42.1 25268405 727.
3 Afghanistan Asia 1997 41.8 22227415 635.
4 Afghanistan Asia 1992 41.7 16317921 649.
5 Afghanistan Asia 1987 40.8 13867957 852.
6 Afghanistan Asia 1982 39.9 12881816 978.
7 Afghanistan Asia 1977 38.4 14880372 786.
8 Afghanistan Asia 1972 36.1 13079460 740.
9 Afghanistan Asia 1967 34.0 11537966 836.
10 Afghanistan Asia 1962 32.0 10267083 853.
# ℹ 1,694 more rows
Which country-year has the lowest life expectancy?
Which country-year has the largest population?
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Rwanda Africa 1992 23.6 7290203 737.
2 Afghanistan Asia 1952 28.8 8425333 779.
3 Gambia Africa 1952 30 284320 485.
4 Angola Africa 1952 30.0 4232095 3521.
5 Sierra Leone Africa 1952 30.3 2143249 880.
6 Afghanistan Asia 1957 30.3 9240934 821.
7 Cambodia Asia 1977 31.2 6978607 525.
8 Mozambique Africa 1952 31.3 6446316 469.
9 Sierra Leone Africa 1957 31.6 2295678 1004.
10 Burkina Faso Africa 1952 32.0 4469979 543.
# ℹ 1,694 more rows
# A tibble: 1,704 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 China Asia 2007 73.0 1318683096 4959.
2 China Asia 2002 72.0 1280400000 3119.
3 China Asia 1997 70.4 1230075000 2289.
4 China Asia 1992 68.7 1164970000 1656.
5 India Asia 2007 64.7 1110396331 2452.
6 China Asia 1987 67.3 1084035000 1379.
7 India Asia 2002 62.9 1034172547 1747.
8 China Asia 1982 65.5 1000281000 962.
9 India Asia 1997 61.8 959000000 1459.
10 China Asia 1977 64.0 943455000 741.
# ℹ 1,694 more rows
select()
# A tibble: 1,704 × 3
country year pop
<fct> <int> <int>
1 Afghanistan 1952 8425333
2 Afghanistan 1957 9240934
3 Afghanistan 1962 10267083
4 Afghanistan 1967 11537966
5 Afghanistan 1972 13079460
6 Afghanistan 1977 14880372
7 Afghanistan 1982 12881816
8 Afghanistan 1987 13867957
9 Afghanistan 1992 16317921
10 Afghanistan 1997 22227415
# ℹ 1,694 more rows
select()
# A tibble: 1,704 × 5
country continent year lifeExp pop
<fct> <fct> <int> <dbl> <int>
1 Afghanistan Asia 1952 28.8 8425333
2 Afghanistan Asia 1957 30.3 9240934
3 Afghanistan Asia 1962 32.0 10267083
4 Afghanistan Asia 1967 34.0 11537966
5 Afghanistan Asia 1972 36.1 13079460
6 Afghanistan Asia 1977 38.4 14880372
7 Afghanistan Asia 1982 39.9 12881816
8 Afghanistan Asia 1987 40.8 13867957
9 Afghanistan Asia 1992 41.7 16317921
10 Afghanistan Asia 1997 41.8 22227415
# ℹ 1,694 more rows
select()
# A tibble: 1,704 × 4
country continent year gdpPercap
<fct> <fct> <int> <dbl>
1 Afghanistan Asia 1952 779.
2 Afghanistan Asia 1957 821.
3 Afghanistan Asia 1962 853.
4 Afghanistan Asia 1967 836.
5 Afghanistan Asia 1972 740.
6 Afghanistan Asia 1977 786.
7 Afghanistan Asia 1982 978.
8 Afghanistan Asia 1987 852.
9 Afghanistan Asia 1992 649.
10 Afghanistan Asia 1997 635.
# ℹ 1,694 more rows
Select only the country
, year
, and lifeExp
variables from gapminder
.
What does the any_of()
function do? Why might it be helpful in conjunction with this vector?
country
, year
, and lifeExp
variables from gapminder
.# A tibble: 1,704 × 3
country year lifeExp
<fct> <int> <dbl>
1 Afghanistan 1952 28.8
2 Afghanistan 1957 30.3
3 Afghanistan 1962 32.0
4 Afghanistan 1967 34.0
5 Afghanistan 1972 36.1
6 Afghanistan 1977 38.4
7 Afghanistan 1982 39.9
8 Afghanistan 1987 40.8
9 Afghanistan 1992 41.7
10 Afghanistan 1997 41.8
# ℹ 1,694 more rows
any_of()
function do? Why might it be helpful in conjunction with this vector?# A tibble: 1,704 × 3
country year lifeExp
<fct> <int> <dbl>
1 Afghanistan 1952 28.8
2 Afghanistan 1957 30.3
3 Afghanistan 1962 32.0
4 Afghanistan 1967 34.0
5 Afghanistan 1972 36.1
6 Afghanistan 1977 38.4
7 Afghanistan 1982 39.9
8 Afghanistan 1987 40.8
9 Afghanistan 1992 41.7
10 Afghanistan 1997 41.8
# ℹ 1,694 more rows
mutate()
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
# ℹ 1,694 more rows
transmute()
# A tibble: 1,704 × 3
country year gdp
<fct> <int> <dbl>
1 Afghanistan 1952 6567086330.
2 Afghanistan 1957 7585448670.
3 Afghanistan 1962 8758855797.
4 Afghanistan 1967 9648014150.
5 Afghanistan 1972 9678553274.
6 Afghanistan 1977 11697659231.
7 Afghanistan 1982 12598563401.
8 Afghanistan 1987 11820990309.
9 Afghanistan 1992 10595901589.
10 Afghanistan 1997 14121995875.
# ℹ 1,694 more rows
summarise()
group_by()
and summarise()
gapminder_continent <- group_by(gapminder, continent)
summarise(
gapminder_continent, avg_pop = mean(pop), avg_gdp_per_cap = mean(gdpPercap)
)
# A tibble: 5 × 3
continent avg_pop avg_gdp_per_cap
<fct> <dbl> <dbl>
1 Africa 9916003. 2194.
2 Americas 24504795. 7136.
3 Asia 77038722. 7902.
4 Europe 17169765. 14469.
5 Oceania 8874672. 18622.
Calculate each country-year’s GDP.
Find each country’s average GDP across all years in the gapminder
data set.
Find the country with the smallest average GDP across these years.
Find the country with the largest average GDP across these years.
# A tibble: 1,704 × 7
country continent year lifeExp pop gdpPercap gdp
<fct> <fct> <int> <dbl> <int> <dbl> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779. 6567086330.
2 Afghanistan Asia 1957 30.3 9240934 821. 7585448670.
3 Afghanistan Asia 1962 32.0 10267083 853. 8758855797.
4 Afghanistan Asia 1967 34.0 11537966 836. 9648014150.
5 Afghanistan Asia 1972 36.1 13079460 740. 9678553274.
6 Afghanistan Asia 1977 38.4 14880372 786. 11697659231.
7 Afghanistan Asia 1982 39.9 12881816 978. 12598563401.
8 Afghanistan Asia 1987 40.8 13867957 852. 11820990309.
9 Afghanistan Asia 1992 41.7 16317921 649. 10595901589.
10 Afghanistan Asia 1997 41.8 22227415 635. 14121995875.
# ℹ 1,694 more rows
gapminder
data set.gapminder_country <- group_by(gapminder, country)
gapminder_country_avg <- summarise(
mutate(gapminder_country, gdp = pop * gdpPercap), avg_gdp = mean(gdp)
)
head(gapminder_country_avg)
# A tibble: 6 × 2
country avg_gdp
<fct> <dbl>
1 Afghanistan 12709647583.
2 Albania 9094669267.
3 Algeria 96735171261.
4 Angola 25532681843.
5 Argentina 266754123835.
6 Australia 320253755823.
# A tibble: 142 × 2
country avg_gdp
<fct> <dbl>
1 Sao Tome and Principe 151723722.
2 Comoros 450509962.
3 Gambia 551998132.
4 Guinea-Bissau 583673324.
5 Djibouti 647199614.
6 Liberia 1054141313.
7 Equatorial Guinea 1143738921.
8 Lesotho 1257957241.
9 Eritrea 1683400010.
10 Burundi 2253225196.
# ℹ 132 more rows
That got messy!
We had a lot of different objects representing intermediate steps in our calculations.
We never need those objects again. Can we avoid creating them?
Let’s introduce perhaps the defining feature of the tidyverse: the pipe.
Read the pipe as:
Take this |> (and then…)
do this |> (and then…)
do this
gapminder |>
group_by(continent) |>
summarise(avg_pop = mean(pop), avg_gdp_per_cap = mean(gdpPercap)) |>
arrange(avg_gdp_per_cap)
# A tibble: 5 × 3
continent avg_pop avg_gdp_per_cap
<fct> <dbl> <dbl>
1 Africa 9916003. 2194.
2 Americas 24504795. 7136.
3 Asia 77038722. 7902.
4 Europe 17169765. 14469.
5 Oceania 8874672. 18622.
Base pipe:
|>
Can be used without loading any packages
Relatively new: introduced in 2021
Tidyverse pipe:
%>%
Must load dplyr
or magrittr
to use
Calculate the average GDP per capita for countries in the Americas for all recorded years after 2000. Display only the country and average GDP per capita variables.
Plot your results.
gapminder_americas_2000 <- gapminder |>
filter(continent == "Americas", year > 2000) |>
group_by(country) |>
summarise(avg_gdp_per_cap = mean(gdpPercap))
gapminder_americas_2000
# A tibble: 25 × 2
country avg_gdp_per_cap
<fct> <dbl>
1 Argentina 10789.
2 Bolivia 3618.
3 Brazil 8599.
4 Canada 34824.
5 Chile 11975.
6 Colombia 6381.
7 Costa Rica 8684.
8 Cuba 7644.
9 Dominican Republic 5295.
10 Ecuador 6323.
# ℹ 15 more rows
ggplot(
gapminder_americas_2000,
aes(x = avg_gdp_per_cap, y = reorder(country, avg_gdp_per_cap))
) +
geom_col() +
theme_minimal() +
labs(title = "Average GDP per capita for countries in the Americas in 2002 and 2007",
x = "Average GDP per capita (US$)",
y = NULL) +
scale_x_continuous(labels = scales::label_dollar())
This session you have:
Learnt R basic syntax
Learnt how to transform your data
Written concise code that is easy to follow
Reminder
In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:
Is relevant to your research interests,
Contains continuous and discrete variables.