GVPT Maths Camp

Data Transformation

Learning objectives

  1. Learn basic operations in R

  2. Be introduced to dplyr

  3. Clean up and transform your data

R objects

Create new objects with <-

x <- 3 * 4

x
[1] 12


x <- 3 * 10

x
[1] 30

R functions

Many functions come with R straight out of the box:

seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10


You can create objects using functions:

x <- seq(1, 10)

x
 [1]  1  2  3  4  5  6  7  8  9 10

Introduction to data transformation

EXERCISE

First, you need to install the gapminder package:

install.packages("gapminder")

Then access the gapminder data set:

library(tidyverse)
library(gapminder)

head(gapminder)
# A tibble: 6 × 6
  country     continent  year lifeExp      pop gdpPercap
  <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
1 Afghanistan Asia       1952    28.8  8425333      779.
2 Afghanistan Asia       1957    30.3  9240934      821.
3 Afghanistan Asia       1962    32.0 10267083      853.
4 Afghanistan Asia       1967    34.0 11537966      836.
5 Afghanistan Asia       1972    36.1 13079460      740.
6 Afghanistan Asia       1977    38.4 14880372      786.

Data types

In gapminder:

  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.

  • int stands for integer.

  • dbl stands for doubles (or real numbers).

Data types

Other types:

  • chr stands for character vectors, or strings.

  • dttm stands for date-times (a date + a time).

  • lgl stands for logical, vectors that contain only TRUE or FALSE.1

Introducing dplyr

Help you with most of your data transformation needs.

Five basic functions:

  • filter()

  • arrange()

  • select()

  • mutate()

  • summarise()

Filter rows with filter()

filter(gapminder, country == "Australia", year > 2000)
# A tibble: 2 × 6
  country   continent  year lifeExp      pop gdpPercap
  <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
1 Australia Oceania    2002    80.4 19546792    30688.
2 Australia Oceania    2007    81.2 20434176    34435.

dplyr basics

  • First argument is always a data frame.

  • Subsequent arguments typically describe which columns to operate on, using the variable names (without quotes).

  • Output is always a new data frame.

Filter rows with filter()

filter(gapminder, continent %in% c("Asia", "Oceania"))
# A tibble: 420 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 410 more rows

Filter rows with filter()

filter(gapminder, pop > 500000 & pop < 1000000)
# A tibble: 88 × 6
   country  continent  year lifeExp    pop gdpPercap
   <fct>    <fct>     <int>   <dbl>  <int>     <dbl>
 1 Bahrain  Asia       1992    72.6 529491    19036.
 2 Bahrain  Asia       1997    73.9 598561    20292.
 3 Bahrain  Asia       2002    74.8 656397    23404.
 4 Bahrain  Asia       2007    75.6 708573    29796.
 5 Botswana Africa     1962    51.5 512764      984.
 6 Botswana Africa     1967    53.3 553541     1215.
 7 Botswana Africa     1972    56.0 619351     2264.
 8 Botswana Africa     1977    59.3 781472     3215.
 9 Botswana Africa     1982    61.5 970347     4551.
10 Comoros  Africa     1997    60.7 527982     1174.
# ℹ 78 more rows

Filter rows with filter()

filter(gapminder, pop > 500000 | pop < 1000000)
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

EXERCISE

  1. Find all country-years that have populations greater than 1 billion people.

  2. Find all countries in Oceania.

  3. Find all countries in both Asia and Europe.

  4. Find all country-years that have a life expectancy greater than 50 years and less than 60 years.

  5. Find all country-years that have a life expectancy less than 50 years or greater than 60 years.

CHECK YOUR ANSWERS

  1. Find all country-years that have populations greater than 1 billion people.
filter(gapminder, pop > 1e9)
# A tibble: 8 × 6
  country continent  year lifeExp        pop gdpPercap
  <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
1 China   Asia       1982    65.5 1000281000      962.
2 China   Asia       1987    67.3 1084035000     1379.
3 China   Asia       1992    68.7 1164970000     1656.
4 China   Asia       1997    70.4 1230075000     2289.
5 China   Asia       2002    72.0 1280400000     3119.
6 China   Asia       2007    73.0 1318683096     4959.
7 India   Asia       2002    62.9 1034172547     1747.
8 India   Asia       2007    64.7 1110396331     2452.

CHECK YOUR ANSWERS

  1. Find all countries in Oceania.
filter(gapminder, continent == "Oceania")
# A tibble: 24 × 6
   country   continent  year lifeExp      pop gdpPercap
   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>
 1 Australia Oceania    1952    69.1  8691212    10040.
 2 Australia Oceania    1957    70.3  9712569    10950.
 3 Australia Oceania    1962    70.9 10794968    12217.
 4 Australia Oceania    1967    71.1 11872264    14526.
 5 Australia Oceania    1972    71.9 13177000    16789.
 6 Australia Oceania    1977    73.5 14074100    18334.
 7 Australia Oceania    1982    74.7 15184200    19477.
 8 Australia Oceania    1987    76.3 16257249    21889.
 9 Australia Oceania    1992    77.6 17481977    23425.
10 Australia Oceania    1997    78.8 18565243    26998.
# ℹ 14 more rows

CHECK YOUR ANSWERS

  1. Find all countries in both Asia and Europe.
filter(gapminder, continent %in% c("Asia", "Europe"))
# A tibble: 756 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 746 more rows

CHECK YOUR ANSWERS

  1. Find all country-years that have a life expectancy greater than 50 years and less than 60 years.
filter(gapminder, lifeExp > 50 & lifeExp < 60)
# A tibble: 336 × 6
   country    continent  year lifeExp      pop gdpPercap
   <fct>      <fct>     <int>   <dbl>    <int>     <dbl>
 1 Albania    Europe     1952    55.2  1282697     1601.
 2 Albania    Europe     1957    59.3  1476505     1942.
 3 Algeria    Africa     1967    51.4 12760499     3247.
 4 Algeria    Africa     1972    54.5 14760787     4183.
 5 Algeria    Africa     1977    58.0 17152804     4910.
 6 Bahrain    Asia       1952    50.9   120447     9867.
 7 Bahrain    Asia       1957    53.8   138655    11636.
 8 Bahrain    Asia       1962    56.9   171863    12753.
 9 Bahrain    Asia       1967    59.9   202182    14805.
10 Bangladesh Asia       1982    50.0 93074406      677.
# ℹ 326 more rows

CHECK YOUR ANSWERS

  1. Find all country-years that have a life expectancy less than 50 years or greater than 60 years.
filter(gapminder, lifeExp < 50 | lifeExp > 60)
# A tibble: 1,368 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,358 more rows

Handy operations

== is equal to


!= is not equal to


>= is greater than or equal to


<= is less than or equal to

Handy operations


| is OR


& is AND


%in% is in

Arrange rows with arrange()

arrange(gapminder, country, year)
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

Arrange rows with arrange()

arrange(gapminder, country, desc(year))
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       2007    43.8 31889923      975.
 2 Afghanistan Asia       2002    42.1 25268405      727.
 3 Afghanistan Asia       1997    41.8 22227415      635.
 4 Afghanistan Asia       1992    41.7 16317921      649.
 5 Afghanistan Asia       1987    40.8 13867957      852.
 6 Afghanistan Asia       1982    39.9 12881816      978.
 7 Afghanistan Asia       1977    38.4 14880372      786.
 8 Afghanistan Asia       1972    36.1 13079460      740.
 9 Afghanistan Asia       1967    34.0 11537966      836.
10 Afghanistan Asia       1962    32.0 10267083      853.
# ℹ 1,694 more rows

EXERCISE

  1. Which country-year has the lowest life expectancy?

  2. Which country-year has the largest population?

CHECK YOUR ANSWERS

  1. Which country-year has the lowest life expectancy?
arrange(gapminder, lifeExp)
# A tibble: 1,704 × 6
   country      continent  year lifeExp     pop gdpPercap
   <fct>        <fct>     <int>   <dbl>   <int>     <dbl>
 1 Rwanda       Africa     1992    23.6 7290203      737.
 2 Afghanistan  Asia       1952    28.8 8425333      779.
 3 Gambia       Africa     1952    30    284320      485.
 4 Angola       Africa     1952    30.0 4232095     3521.
 5 Sierra Leone Africa     1952    30.3 2143249      880.
 6 Afghanistan  Asia       1957    30.3 9240934      821.
 7 Cambodia     Asia       1977    31.2 6978607      525.
 8 Mozambique   Africa     1952    31.3 6446316      469.
 9 Sierra Leone Africa     1957    31.6 2295678     1004.
10 Burkina Faso Africa     1952    32.0 4469979      543.
# ℹ 1,694 more rows

CHECK YOUR ANSWERS

  1. Which country-year has the largest population?
arrange(gapminder, desc(pop))
# A tibble: 1,704 × 6
   country continent  year lifeExp        pop gdpPercap
   <fct>   <fct>     <int>   <dbl>      <int>     <dbl>
 1 China   Asia       2007    73.0 1318683096     4959.
 2 China   Asia       2002    72.0 1280400000     3119.
 3 China   Asia       1997    70.4 1230075000     2289.
 4 China   Asia       1992    68.7 1164970000     1656.
 5 India   Asia       2007    64.7 1110396331     2452.
 6 China   Asia       1987    67.3 1084035000     1379.
 7 India   Asia       2002    62.9 1034172547     1747.
 8 China   Asia       1982    65.5 1000281000      962.
 9 India   Asia       1997    61.8  959000000     1459.
10 China   Asia       1977    64.0  943455000      741.
# ℹ 1,694 more rows

Select columns with select()

select(gapminder, country, year, pop)
# A tibble: 1,704 × 3
   country      year      pop
   <fct>       <int>    <int>
 1 Afghanistan  1952  8425333
 2 Afghanistan  1957  9240934
 3 Afghanistan  1962 10267083
 4 Afghanistan  1967 11537966
 5 Afghanistan  1972 13079460
 6 Afghanistan  1977 14880372
 7 Afghanistan  1982 12881816
 8 Afghanistan  1987 13867957
 9 Afghanistan  1992 16317921
10 Afghanistan  1997 22227415
# ℹ 1,694 more rows

Select columns with select()

select(gapminder, country:pop)
# A tibble: 1,704 × 5
   country     continent  year lifeExp      pop
   <fct>       <fct>     <int>   <dbl>    <int>
 1 Afghanistan Asia       1952    28.8  8425333
 2 Afghanistan Asia       1957    30.3  9240934
 3 Afghanistan Asia       1962    32.0 10267083
 4 Afghanistan Asia       1967    34.0 11537966
 5 Afghanistan Asia       1972    36.1 13079460
 6 Afghanistan Asia       1977    38.4 14880372
 7 Afghanistan Asia       1982    39.9 12881816
 8 Afghanistan Asia       1987    40.8 13867957
 9 Afghanistan Asia       1992    41.7 16317921
10 Afghanistan Asia       1997    41.8 22227415
# ℹ 1,694 more rows

Select columns with select()

select(gapminder, -(lifeExp:pop))
# A tibble: 1,704 × 4
   country     continent  year gdpPercap
   <fct>       <fct>     <int>     <dbl>
 1 Afghanistan Asia       1952      779.
 2 Afghanistan Asia       1957      821.
 3 Afghanistan Asia       1962      853.
 4 Afghanistan Asia       1967      836.
 5 Afghanistan Asia       1972      740.
 6 Afghanistan Asia       1977      786.
 7 Afghanistan Asia       1982      978.
 8 Afghanistan Asia       1987      852.
 9 Afghanistan Asia       1992      649.
10 Afghanistan Asia       1997      635.
# ℹ 1,694 more rows

EXERCISE

  1. Select only the country, year, and lifeExp variables from gapminder.

  2. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

vars <- c("country", "year", "lifeExp")
  1. What does the following code produce?
select(gapminder, starts_with("c"))

CHECK YOUR ANSWERS

  1. Select only the country, year, and lifeExp variables from gapminder.
select(gapminder, country, year, lifeExp)
# A tibble: 1,704 × 3
   country      year lifeExp
   <fct>       <int>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# ℹ 1,694 more rows

CHECK YOUR ANSWERS

  1. What does the any_of() function do? Why might it be helpful in conjunction with this vector?
vars <- c("country", "year", "lifeExp")

select(gapminder, any_of(vars))
# A tibble: 1,704 × 3
   country      year lifeExp
   <fct>       <int>   <dbl>
 1 Afghanistan  1952    28.8
 2 Afghanistan  1957    30.3
 3 Afghanistan  1962    32.0
 4 Afghanistan  1967    34.0
 5 Afghanistan  1972    36.1
 6 Afghanistan  1977    38.4
 7 Afghanistan  1982    39.9
 8 Afghanistan  1987    40.8
 9 Afghanistan  1992    41.7
10 Afghanistan  1997    41.8
# ℹ 1,694 more rows

CHECK YOUR ANSWERS

  1. What does the following code produce?
select(gapminder, starts_with("c"))
# A tibble: 1,704 × 2
   country     continent
   <fct>       <fct>    
 1 Afghanistan Asia     
 2 Afghanistan Asia     
 3 Afghanistan Asia     
 4 Afghanistan Asia     
 5 Afghanistan Asia     
 6 Afghanistan Asia     
 7 Afghanistan Asia     
 8 Afghanistan Asia     
 9 Afghanistan Asia     
10 Afghanistan Asia     
# ℹ 1,694 more rows

Add new variables with mutate()

mutate(gapminder, gdp = gdpPercap * pop)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap          gdp
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
 7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
 8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
 9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ℹ 1,694 more rows

Select, transform, and add new variables with transmute()

transmute(gapminder, country, year, gdp = gdpPercap * pop)
# A tibble: 1,704 × 3
   country      year          gdp
   <fct>       <int>        <dbl>
 1 Afghanistan  1952  6567086330.
 2 Afghanistan  1957  7585448670.
 3 Afghanistan  1962  8758855797.
 4 Afghanistan  1967  9648014150.
 5 Afghanistan  1972  9678553274.
 6 Afghanistan  1977 11697659231.
 7 Afghanistan  1982 12598563401.
 8 Afghanistan  1987 11820990309.
 9 Afghanistan  1992 10595901589.
10 Afghanistan  1997 14121995875.
# ℹ 1,694 more rows

Create summaries with summarise()

summarise(
  gapminder, avg_pop = mean(pop), avg_gdp_per_cap = mean(gdpPercap)
)
# A tibble: 1 × 2
    avg_pop avg_gdp_per_cap
      <dbl>           <dbl>
1 29601212.           7215.

Creating grouped summaries with group_by() and summarise()

gapminder_continent <- group_by(gapminder, continent)

summarise(
  gapminder_continent, avg_pop = mean(pop), avg_gdp_per_cap = mean(gdpPercap)
)
# A tibble: 5 × 3
  continent   avg_pop avg_gdp_per_cap
  <fct>         <dbl>           <dbl>
1 Africa     9916003.           2194.
2 Americas  24504795.           7136.
3 Asia      77038722.           7902.
4 Europe    17169765.          14469.
5 Oceania    8874672.          18622.

EXERCISE

  1. Calculate each country-year’s GDP.

  2. Find each country’s average GDP across all years in the gapminder data set.

  3. Find the country with the smallest average GDP across these years.

  4. Find the country with the largest average GDP across these years.

CHECK YOUR ANSWERS

  1. Calculate each country-year’s GDP.
mutate(gapminder, gdp = pop * gdpPercap)
# A tibble: 1,704 × 7
   country     continent  year lifeExp      pop gdpPercap          gdp
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
 2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
 3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
 4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
 5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
 6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
 7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
 8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
 9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
# ℹ 1,694 more rows

CHECK YOUR ANSWERS

  1. Find each country’s average GDP across all years in the gapminder data set.
gapminder_country <- group_by(gapminder, country)

gapminder_country_avg <- summarise(
  mutate(gapminder_country, gdp = pop * gdpPercap), avg_gdp = mean(gdp)
)

head(gapminder_country_avg)
# A tibble: 6 × 2
  country           avg_gdp
  <fct>               <dbl>
1 Afghanistan  12709647583.
2 Albania       9094669267.
3 Algeria      96735171261.
4 Angola       25532681843.
5 Argentina   266754123835.
6 Australia   320253755823.

CHECK YOUR ANSWERS

  1. Find the country with the smallest average GDP across these years.
arrange(gapminder_country_avg, avg_gdp)
# A tibble: 142 × 2
   country                   avg_gdp
   <fct>                       <dbl>
 1 Sao Tome and Principe  151723722.
 2 Comoros                450509962.
 3 Gambia                 551998132.
 4 Guinea-Bissau          583673324.
 5 Djibouti               647199614.
 6 Liberia               1054141313.
 7 Equatorial Guinea     1143738921.
 8 Lesotho               1257957241.
 9 Eritrea               1683400010.
10 Burundi               2253225196.
# ℹ 132 more rows

CHECK YOUR ANSWERS

  1. Find the country with the largest average GDP across these years.
arrange(gapminder_country_avg, desc(avg_gdp))
# A tibble: 142 × 2
   country        avg_gdp
   <fct>            <dbl>
 1 United States  6.40e12
 2 Japan          2.12e12
 3 China          1.70e12
 4 Germany        1.62e12
 5 United Kingdom 1.11e12
 6 France         1.04e12
 7 Italy          9.11e11
 8 India          8.62e11
 9 Brazil         8.11e11
10 Mexico         5.95e11
# ℹ 132 more rows

Combine multiple operations with the pipe

That got messy!

  • We had a lot of different objects representing intermediate steps in our calculations.

  • We never need those objects again. Can we avoid creating them?

Let’s introduce perhaps the defining feature of the tidyverse: the pipe.

Combine multiple operations with the pipe

Read the pipe as:

Take this |> (and then…)
    do this |> (and then…)
    do this

gapminder |> 
  group_by(continent) |> 
  summarise(avg_pop = mean(pop), avg_gdp_per_cap = mean(gdpPercap)) |> 
  arrange(avg_gdp_per_cap)
# A tibble: 5 × 3
  continent   avg_pop avg_gdp_per_cap
  <fct>         <dbl>           <dbl>
1 Africa     9916003.           2194.
2 Americas  24504795.           7136.
3 Asia      77038722.           7902.
4 Europe    17169765.          14469.
5 Oceania    8874672.          18622.

Combine multiple operations with the pipe

gapminder |> 
  group_by(continent) |> 
  summarise(avg_pop = mean(pop), avg_gdp_per_cap = mean(gdpPercap)) |> 
  ggplot(aes(x = continent, y = avg_gdp_per_cap)) + 
  geom_col() + 
  theme_minimal()

A note on the pipe

Base pipe:

  • |>

  • Can be used without loading any packages

  • Relatively new: introduced in 2021

Tidyverse pipe:

  • %>%

  • Must load dplyr or magrittr to use

EXERCISE

  1. Calculate the average GDP per capita for countries in the Americas for all recorded years after 2000. Display only the country and average GDP per capita variables.

  2. Plot your results.

CHECK YOUR ANSWERS

  1. Calculate the average GDP per capita for countries in the Americas in all years including or after 2000.
gapminder_americas_2000 <- gapminder |> 
  filter(continent == "Americas", year > 2000) |> 
  group_by(country) |> 
  summarise(avg_gdp_per_cap = mean(gdpPercap))

gapminder_americas_2000
# A tibble: 25 × 2
   country            avg_gdp_per_cap
   <fct>                        <dbl>
 1 Argentina                   10789.
 2 Bolivia                      3618.
 3 Brazil                       8599.
 4 Canada                      34824.
 5 Chile                       11975.
 6 Colombia                     6381.
 7 Costa Rica                   8684.
 8 Cuba                         7644.
 9 Dominican Republic           5295.
10 Ecuador                      6323.
# ℹ 15 more rows

CHECK YOUR ANSWERS

  1. Plot your results.

CHECK YOUR ANSWERS

  1. Plot your results.
ggplot(
  gapminder_americas_2000, 
  aes(x = avg_gdp_per_cap, y = reorder(country, avg_gdp_per_cap))
) + 
  geom_col() +
  theme_minimal() + 
  labs(title = "Average GDP per capita for countries in the Americas in 2002 and 2007",
       x = "Average GDP per capita (US$)",
       y = NULL) + 
  scale_x_continuous(labels = scales::label_dollar())

Summary

This session you have:

  1. Learnt R basic syntax

  2. Learnt how to transform your data

  3. Written concise code that is easy to follow

HOMEWORK

Reminder

In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:

  1. Is relevant to your research interests,

  2. Contains continuous and discrete variables.