[1] 3
Data Visualisation
Introduction to R
Create your first plot in R
Test your hypotheses using informative data visualizations
R code:
Functions:
45
, 978
, and 121
.67
divided by 6
?894
? HINT: use the sqrt()
function.Using the consol, find the summation of 45
, 978
, and 121
.
Or:
What is 67
divided by 6
?
What is the square root of 894
?
R has a large number of in-built functions that allow you to manipulate your data.
For example, here’s a function that provides all numbers in a specified sequence:
Arguments are positional:
Packages are collections of R functions and data.
tidyverse
packages.For your sanity’s sake, for your co-author’s sanity’s sake
Keeps everything:
Organised
Reproducible
Sustainable
getwd()
to see where you are on your computer.Source: R4DS
Source: R4DS
Source: R4DS
getwd()
to see where you now are on your computer.From R4DS - Data Visualization:
Do cars with big engines use more fuel than cars with small engines?
This session will borrow (read: steal) heavily from Hadley Wickham’s R for Data Science book.
The. Best. Resource.
Hadley Wickham is one of the lead authors of the tidyverse. He created ggplot
through his PhD dissertation.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(colour = class)) +
geom_smooth(method = "lm") +
theme(
legend.position = "bottom",
panel.grid = element_blank(),
panel.background = element_blank(),
plot.title.position = "plot",
plot.title = element_text(face = "bold")
) +
labs(
title = "Engine displacement and highway miles per gallon",
subtitle = "Values for seven different classes of cars",
x = "Engine displacement (L)",
y = "Highway miles per gallon"
)
manufacturer | model | displ | year | cyl |
---|---|---|---|---|
audi | a4 | 1.8 | 1999 | 4 |
audi | a4 | 1.8 | 1999 | 4 |
audi | a4 | 2.0 | 2008 | 4 |
audi | a4 | 2.0 | 2008 | 4 |
audi | a4 | 2.8 | 1999 | 6 |
audi | a4 | 2.8 | 1999 | 6 |
Learn more about this data set by typing ?mpg
into your console.
mpg
data setRows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
mpg
data setA couple of useful variables:
displ
: engine displacement, in litres
hwy
: highway miles per gallon
mpg
? How many columns?drv
variable describe?An empty canvas!
Take a closer look at the hwy
and cyl
variables. Describe them to me.
Make a scatter plot of them.
Take a closer look at the class
and drv
variables. Describe them to me.
Make a scatter plot of them. Why doesn’t this work as a data visualization?
You can use visual elements or aesthetics (aes
) to communicate many dimensions in your data.
Let’s look at a categorical variable: the class of car (SUV, 2 seater, pick up truck, etc.).
Look for meaningfully defined groups.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(colour = class)) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Engine displacement and highway miles per gallon",
subtitle = "Values for seven different classes of cars",
x = "Engine displacement (L)",
y = "Highway miles per gallon"
)
You can use visual elements to communicate your findings in engaging ways.
What’s gone wrong with this code? Why are the points not blue?
Name a categorical variable in mpg
. Name a continuous one.
Map a continuous variable to color
. How does this aesthetics behave differently for categorical vs. continuous variables?
Map class
to the shape
aesthetic. What does the warning tell you?
Less is more when it comes to data visualization.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(colour = class)) +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(
title = "Engine displacement and highway miles per gallon",
subtitle = "Values for seven different classes of cars",
x = "Engine displacement (L)",
y = "Highway miles per gallon"
)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(colour = class)) +
geom_smooth(method = "lm") +
theme(
legend.position = "bottom",
panel.grid = element_blank(),
panel.background = element_blank(),
plot.title.position = "plot",
plot.title = element_text(face = "bold")
) +
labs(
title = "Engine displacement and highway miles per gallon",
subtitle = "Values for seven different classes of cars",
x = "Engine displacement (L)",
y = "Highway miles per gallon"
)
Customize the last plot you made using the theme()
argument.
We often want to explore patterns in categorical (or discrete) data. We need new tools to do this.
Reorder in relation to frequency
This session you:
Set up your data science tools
Plotted complex data in an engaging way
Discovered interesting relationships in the data
Connected these relationships or trends to your expectations (or hypotheses about the data)
In the final session, you will apply the skills you will learn over the next few days to a problem that interests you. To prepare for this, you need to find a data set that:
Is relevant to your research interests,
Contains continuous and discrete variables.