Appendix A — Basic ggplot2

A.1 Plotting in the tidyverse

ggplot2 forms a key part of the tidyverse – for many the only part. It builds on the grammar of graphics proposed by the late Leland Wilkinson, Wilkinson (2013). In essence it provides rules for how graphics should be treated, simple rules that drive you mad until you get it.

The process for building a graph is something like the following.

  • Initiate a plot using ggplot.
  • Specify aesthetics which indicate what you want to plot from some data set.
  • Call a geom (or an alternative) to say how you want to plot it.
  • Add modifiers to change how it looks.

The order of operations is essentially always this, although quite how the ordering is apllied differs subtly, which we will show here.

A.2 Example

To illustrate, we take the wooldridge data set approval from Yong, Krosnick, and Wooldridge (2016), do a little wrangling and (eventually) produce some quite nice plots. Start with the libraries and retrieve data the data.

library(tidyverse)
library(wooldridge)
data("approval")

The first few columns and rows of this looks like:

   id month year   sp500   cpi cpifood approve
1 302     2 2001 1239.94 184.4   171.8   59.24
2 303     3 2001 1160.33 185.3   172.2   57.01
3 304     4 2001 1249.46 185.6   172.4   60.31
4 305     5 2001 1255.82 185.5   172.9   55.82
5 306     6 2001 1224.42 185.9   173.4   54.93
6 307     7 2001 1211.23 186.2   174.0   56.36

and all the available variables are

names(approval)
 [1] "id"         "month"      "year"       "sp500"      "cpi"       
 [6] "cpifood"    "approve"    "gasprice"   "unemploy"   "katrina"   
[11] "rgasprice"  "lrgasprice" "X11.Sep"    "iraqinvade" "lsp500"    
[16] "lcpifood"  

Typically we want to investigate trends and correlations and graphing pairs or more of series is a good way to begin.

A.2.1 Scatter plot

A first scatter plot, using geom_point of food against gas (petrol) prices

ggplot(approval, aes(x=lcpifood, y=lrgasprice)) +    # Initiate, set aesthetics
  geom_point()                                       # Display as points

OK, I guess, but a bit dull – so add some colour. This time, aes is specified in the geom – either is fine, but there are some advantages either way which we will see shortly.

ggplot(approval) +
  geom_point(aes(x=lcpifood, y=lrgasprice, color=month))  # Colours by month

Better, but how about…

ggplot(approval) +
  geom_point(aes(x=lcpifood, y=lrgasprice, color=approve), size=2, shape=17) + # Colours by popularity!
  scale_color_gradient(low="red", high="green") 

where the colours are a gradient we specify. But months can only be one of twelve categories, so a categorical variable (a factor) is needed to get different actual colours, otherwise for a continuous variable I get shades of one colour or a continuous change we need to specify.

Lets do this – and add a different aesthetic, size, for year.

ggplot(approval) +
  geom_point(aes(x=lcpifood, y=lrgasprice, color=as.factor(month), size=as.factor(year)))
Warning: Using size for a discrete variable is not advised.

Note there is now a lot going n, and maybe too much. ggplot thinks so!

A.2.2 Time series plots

Our time index is a bit odd as the data set has year and month separately. Create a proper date series using:

approval %<>% 
  unite(date, year, month, sep="/") %>% 
  mutate(date = as.Date(paste0(date,"/01"), "%Y/%m/%d"))

I’ve used the %<>% pipe operator to send and get back approval so this is now

   id       date   sp500   cpi cpifood approve gasprice unemploy katrina
1 302 2001-02-01 1239.94 184.4   171.8   59.24    148.4      4.6       0
2 303 2001-03-01 1160.33 185.3   172.2   57.01    144.7      4.5       0
3 304 2001-04-01 1249.46 185.6   172.4   60.31    156.4      4.2       0
4 305 2001-05-01 1255.82 185.5   172.9   55.82    172.9      4.1       0
5 306 2001-06-01 1224.42 185.9   173.4   54.93    164.0      4.7       0
6 307 2001-07-01 1211.23 186.2   174.0   56.36    148.2      4.7       0
  rgasprice lrgasprice X11.Sep iraqinvade   lsp500 lcpifood
1  80.47723   4.387974       0          0 7.122818 5.146331
2  78.08958   4.357857       0          0 7.056460 5.148656
3  84.26724   4.433993       0          0 7.130467 5.149817
4  93.20755   4.534829       0          0 7.135544 5.152713
5  88.21947   4.479828       0          0 7.110222 5.155601
6  79.59184   4.376912       0          0 7.099391 5.159055

Then I can plot a couple of series using two calls to geom_line

ggplot(approval) +
  geom_line(aes(x=date, y=unemploy), colour="red") +
  geom_line(aes(x=date, y=cpi), colour="blue") 

But this is pretty inefficient, as I would need a call to geom_line for every series I wanted to plot and even then scales are unsuitable. Plus the labels are not right.

This is where things really get interesting. I pivot_longer all the variables into a single column.

df <- pivot_longer(approval, cols=-c(date, id), names_to= "Var", values_to = "Val")
head(df)
# A tibble: 6 × 4
     id date       Var         Val
  <int> <date>     <chr>     <dbl>
1   302 2001-02-01 sp500    1240. 
2   302 2001-02-01 cpi       184. 
3   302 2001-02-01 cpifood   172. 
4   302 2001-02-01 approve    59.2
5   302 2001-02-01 gasprice  148. 
6   302 2001-02-01 unemploy    4.6

Great! Now I can plot Val using one call to geom_line. This time, put the graph object into p and then explicitly plot it.

p  <- ggplot(df) +
  geom_line(aes(x=date, y=Val))
plot(p)

Oops! I need to tell ggplot2 to separate out the variables which are stored in Var. For this, use group:

p  <- ggplot(df) +
  geom_line(aes(x=date, y=Val, group=Var))
plot(p)

But this could better be done by using an aesthetic like colour which implies group

p  <- ggplot(df) +
  geom_line(aes(x=date, y=Val, colour=Var))
plot(p)

OK, but can I plot them so we can see what’s going on, like in a grid? This is where facet comes in.

p  <- p +
  facet_wrap(~Var, scales = "free")
plot(p)

A bit more formatting…

p  <- p +
  theme_minimal() + 
  labs(title="Facet plots", x="", y="")
plot(p)

Finally all in one go, dropping the dummies, don’t store as an object. Also no legend, as series labelled in the facets. And I call a rather handy little function geom_smooth which fits (by default) a Loess smoothing line.

approval %>% 
  select(-iraqinvade, -katrina, -X11.Sep) %>%
  pivot_longer(cols=-c(date, id), names_to="Var", values_to="Val") %>%
  ggplot(aes(x=date, y=Val, group=Var, colour=Var)) +
  geom_line() +
  geom_smooth() + # Smoother
  facet_wrap(~Var, scales = "free") +
  theme_minimal() + 
  theme(legend.position = "none") +
  labs(title="Facet plots", x="", y="")

Cool, huh?