GVPT Maths Camp

Data Visualisation

Learning objectives for today

  1. Introduction to R

  2. Create your first plot in R

  3. Test your hypotheses using informative data visualizations

R basics

R code:

1 + 2
[1] 3

Functions:

sum(1, 2)
[1] 3

EXERCISE

  1. Open up RStudio.
  2. Using the console, find the summation of 45, 978, and 121.
  3. What is 67 divided by 6?
  4. What is the square root of 894? HINT: use the sqrt() function.

CHECK YOUR ANSWERS

Using the consol, find the summation of 45, 978, and 121.

sum(45, 978, 121)
[1] 1144

Or:

45 + 978 + 121
[1] 1144

What is 67 divided by 6?

67 / 6
[1] 11.16667

What is the square root of 894?

sqrt(894)
[1] 29.89983

Functions

R packages

Packages are collections of R functions and data.

# Install the relevant package(s)
install.packages("tidyverse")

# Load the packages in current session
library(tidyverse)

EXERCISE

  1. Open up RStudio.
  2. Using the console, install the tidyverse packages.
install.packages("tidyverse")
  1. Load these packages in your current session
library(tidyverse)

RStudio Projects

For your sanity’s sake, for your co-author’s sanity’s sake

Keeps everything:

  • Organised

  • Reproducible

  • Sustainable

EXERCISE

  1. In the RStudio console, type in getwd() to see where you are on your computer.

EXERCISE

  1. Create a new RStudio project for this Maths Camp:

Source: R4DS

EXERCISE

  1. Create a new RStudio project for this Maths Camp:

Source: R4DS

EXERCISE

  1. Create a new RStudio project for this Maths Camp:

Source: R4DS

Data visualisation

From R4DS - Data Visualization:

Do cars with big engines use more fuel than cars with small engines?

R4DS

This session will borrow (read: steal) heavily from Hadley Wickham’s R for Data Science book.

  • The. Best. Resource.

  • Hadley Wickham is one of the lead authors of the tidyverse. He created ggplot through his PhD dissertation.

Source: R4DS

Skipping to the end

How did we do this?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

Load relevant packages and data

# Load the relevant packages
library(tidyverse)

# Load the data
mpg
manufacturer model displ year cyl
audi a4 1.8 1999 4
audi a4 1.8 1999 4
audi a4 2.0 2008 4
audi a4 2.0 2008 4
audi a4 2.8 1999 6
audi a4 2.8 1999 6

EXERCISE


Learn more about this data set by typing ?mpg into your console.

The mpg data set

glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

The mpg data set

A couple of useful variables:

  • displ: engine displacement, in litres

  • hwy: highway miles per gallon

EXERCISE

  1. How many rows are in mpg? How many columns?
nrow(mpg)
ncol(mpg)


  1. What does the drv variable describe?
?mpg

Set up your plot

An empty canvas!

library(ggplot2)
library(ggthemes)

ggplot(data = mpg)

Map your aesthetics

ggplot(data = mpg, mapping = aes(x = displ, y = hwy))

Add in your cars

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point()

EXERCISE

  1. Make a scatter plot of hwy vs cyl.


  1. What happens if you make a scatter plot of class vs drv? Why is the plot not useful?


  1. Why does the following give an error and how would you fix it?
ggplot(data = mpg) + 
  geom_point()

Let’s look at groups in the data

  • We are not restricted to looking at only two interesting elements of our data.

  • You can use visual elements or aesthetics (aes) to communicate many dimensions in your data.

  • Let’s look at a categorical variable: the class of car (SUV, 2 seater, pick up truck, etc.).

  • Look for meaningfully defined groups.

Let’s look at groups in the data

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = class)) + 
  geom_point()

Add more information

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, colour = class)) + 
  geom_point() + 
  geom_smooth(method = "lm") + 
  scale_color_colorblind()

Look at the relationship across all cars

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  scale_color_colorblind()

Add useful titles and labels

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

Add useful titles and labels

Flexible visualization

You can use visual elements to communicate your findings in engaging ways.

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class == "2seater"))

Changing the look of your plots

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), colour = "red")

EXERCISE

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

EXERCISE

  1. Name a categorical variable in mpg. Name a continuous one.


  1. Map a continuous variable to color. How does this aesthetics behave differently for categorical vs. continuous variables?


  1. Map class to the shape aesthetic. What does the warning tell you?

Let’s clean our graph up

Less is more when it comes to data visualization.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  theme_minimal() + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

Let’s clean this up

EXERCISE

Head over to the ggplot documentation and find your favorite preset theme.

Creating your own theme

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(colour = class)) + 
  geom_smooth(method = "lm") + 
  theme(
    legend.position = "bottom",
    panel.grid = element_blank(),
    panel.background = element_blank(),
    plot.title.position = "plot",
    plot.title = element_text(face = "bold")
  ) + 
  labs(
    title = "Engine displacement and highway miles per gallon",
    subtitle = "Values for seven different classes of cars",
    x = "Engine displacement (L)",
    y = "Highway miles per gallon"
  ) + 
  scale_color_colorblind()

Creating your own theme

The before shot

EXERCISE

Customize the last plot you made using the theme() argument.

Working with categorical data

We often want to explore patterns in categorical (or discrete) data. We need new tools to do this.


select(mpg, manufacturer, model, drv)
# A tibble: 234 × 3
   manufacturer model      drv  
   <chr>        <chr>      <chr>
 1 audi         a4         f    
 2 audi         a4         f    
 3 audi         a4         f    
 4 audi         a4         f    
 5 audi         a4         f    
 6 audi         a4         f    
 7 audi         a4         f    
 8 audi         a4 quattro 4    
 9 audi         a4 quattro 4    
10 audi         a4 quattro 4    
# ℹ 224 more rows

Visualizing distributions

ggplot(mpg, aes(x = drv)) + 
  geom_bar()

Visualizing distributions

Reorder in relation to frequency

ggplot(mpg, aes(x = fct_infreq(drv))) +
  geom_bar()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy)) +
  geom_histogram()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy)) +
  geom_density()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy, colour = drv)) +
  geom_density()

Visualizing numeric variables

ggplot(mpg, aes(x = hwy, colour = drv, fill = drv)) +
  geom_density(alpha = 0.5)

Summary

This session you:

  1. Set up your data science tools

  2. Plotted complex data in an engaging way

  3. Discovered interesting relationships in the data

  4. Connected these relationships or trends to your expectations (or hypotheses about the data)

HOMEWORK

In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:

  1. Is relevant to your research interests,

  2. Contains continuous and discrete variables.