Ggplot2 notes

About these notes...
I'm writing these notes so I don't have to go back and re-learn how ggplot2 (and R) work(s) every time I want to use it/them. I'm putting this document online in the hope that someone else finds it helpful. This is a much more polished guide on how to use ggplot2: http://www.cookbook-r.com/Graphs/

The Grammar of Graphics

Ggplot2 is an implementation of the Grammar of Graphics, as Leland Wilkinson describes in his book, The Grammar of Graphics. I haven't actually read the book. I just know the title. From the title, I've drawn my own assumptions about the Grammar part of Grammar of Graphics and have decided that it might be helpful to think of ggplot charts as complete sentences1. They all have a subject, a verb, and an object. I'll go over each of these parts to show how to make a chart.

Subjects

Subjects are your data. Ggplot2 only speaks one language, however, so you'll likely have to convert your data into a format that ggplot2 can use before doing anything else. At a minimum, you'll likely need a column for the x-axis and another for the y-axis. If you want to make a chart of something with multiple series2, you'll probably want to put all your measured variables in one column.

The reshape2 and plyr packages are probably most helpful for this purpose. The tapply function (in the base package) can be helpful, too, but I'd try reshape2, then plyr, then tapply.

Intro to our Example

For this example, I'll be using the mpg data that comes with the ggplot2 package to demonstrate the use of both reshape2 and plyr.

Suppose I want to make a chart with points and lines that shows that, in 1999 and in 2008, the Ford F-150 had better fuel economy for both city and highway than did the RAM 1500.

I'm starting with a data.frame that looks like this. The ellipsis (...) at the end just means I stopped copying there:

> library(ggplot2)
> mpg
    manufacturer                  model displ year cyl      trans drv cty hwy
1           audi                     a4   1.8 1999   4   auto(l5)   f  18  29
2           audi                     a4   1.8 1999   4 manual(m5)   f  21  29
3           audi                     a4   2.0 2008   4 manual(m6)   f  20  31
...

Taking Subsets of Data (Rows)

First, I need to whittle down the data, so I'm only looking at data for the F-150 and the RAM 1500. To do that, I just select by model name:

> just_trucks <- mpg[mpg$model == "f150 pickup 4wd" | mpg$model == "ram 1500 pickup
4wd",]
> just_trucks
   manufacturer               model displ year cyl      trans drv cty hwy fl
65        dodge ram 1500 pickup 4wd   4.7 2008   8 manual(m6)   4  12  16  r
66        dodge ram 1500 pickup 4wd   4.7 2008   8   auto(l5)   4   9  12  e
67        dodge ram 1500 pickup 4wd   4.7 2008   8   auto(l5)   4  13  17  r
...

A couple of things to point out:

  1. The comma (second line) with nothing after it. Everything before the comma specifies the rows I want and everything after the comma specifies the columns. Because there's nothing after the comma, we get all the columns.
  2. mpg$model is the same thing as mpg[["model"]] or, less strictly, mpg[,"model"] (again, notice the comma). They all give us the column named "model."
  3. Using the equality operator (==) with mpg$model and "f150 pickup 4wd" gives us something that tells us which rows in the model column match f150 pickup 4wd and which ones don't.
  4. The vertical bar, or pipe (|), means "or."

To sum up, the first (and second) line tells R to give us all the rows and columns in mpg where the model column matches "f150 pickup 4wd" or "ram 1500 pickup 4wd".

Taking Subsets of Data (Columns)

Just to demonstrate, I'll also take a subset of the columns. I just want the model, year, cty, and hwy columns. Conveniently, there's a subset() function for doing just that:

> just_trucks <- subset(just_trucks, select = c(model, year, cty, hwy))
> just_trucks
                 model year cty hwy
65 ram 1500 pickup 4wd 2008  12  16
66 ram 1500 pickup 4wd 2008   9  12
67 ram 1500 pickup 4wd 2008  13  17
...

'Melting' Data (i.e., Consolidating Columns)

Now that I have just the data I want, I need to format it so ggplot can draw separate lines for the city and highway fuel economy ratings for each truck.

Here's where the reshape2 package comes in handy:

> library(reshape2)
> jt_melted <- melt(just_trucks, measure.vars = c("cty", "hwy"))
> jt_melted
                 model year variable value
1  ram 1500 pickup 4wd 2008      cty    12
2  ram 1500 pickup 4wd 2008      cty     9
3  ram 1500 pickup 4wd 2008      cty    13
...

Things to point out:

  1. The first argument of the melt() function is the data.frame we want to melt. In this case, it's just_trucks.
  2. The measure.vars argument (used in measure.vars = c("cty", "hwy")) specifies that we want the cty and hwy columns combined.
  3. The melt() function took the cty and hwy columns and combined their values into one column called value. It also made another column, called variable, for indicating which value we're talking about, exactly (city or highway MPGs).

This combining makes it easier to draw different lines for city and highway MPGs on the graph/chart.

You can also specify which columns should not have their values consolidated by using id.vars, and if you wish to rename the value and variable columns, you can use the value.name and variable.name arguments, respectively.

Example:
melt(just_trucks, id.vars = c("model", "year"), measure.vars = c("cty", "hwy"),
value.name = "mpg_ratings", variable.name = "which_mpg")

Verbs and Objects

In the context of ggplot, it's hard to demonstrate verbs without showing their objects (charts), so this next section includes a little bit of both.

Verbs are what you do to transform your data before using them to make a chart. In this case, I want to take average mpgs for each truck and year for both their city and highway fuel economy ratings. Hopefully, doing so will ensure that the F-150 appears to have better fuel economy ratings than the RAM 1500.

There are a couple of ways to take these averages:

  1. The plyr package includes the summarize() and ddply() functions that can help transform my data.
  2. The ggplot2 package provides some verbs (statistics functions) that can modify the data before they are displayed.

The first way makes for an easier transition into explaining how to make a chart (which is what these notes are for in the first place), so I'll show how to make a chart with the first method and come back to the second method at the end.

Taking means with the ddply() function in the plyr package

This first method is fairly straightforward. We load the plyr package and make a new data.frame with the averages we want:

> library(plyr)
> jt_avg <- ddply(jt_melted, c("model", "year", "variable"), summarize,
avg = mean(value))
> jt_avg
                model year variable      avg
1     f150 pickup 4wd 1999      cty 13.00000
2     f150 pickup 4wd 1999      hwy 16.20000
3     f150 pickup 4wd 2008      cty 13.00000
4     f150 pickup 4wd 2008      hwy 17.00000
5 ram 1500 pickup 4wd 1999      cty 11.00000
6 ram 1500 pickup 4wd 1999      hwy 15.33333
7 ram 1500 pickup 4wd 2008      cty 11.57143
8 ram 1500 pickup 4wd 2008      hwy 15.28571

Here, I've supplied the ddply() function with three arguments. It looks like four, but it's actually three:

  1. The first argument is the data.frame we care about.
  2. The second argument specifies the columns with which we want to group things.
  3. The third argument specifies the function we want to use for number-crunching. In this case, I want to take all the values for a given model, year, and fuel economy rating type, and summarize the MPGs as a mean, so the summarize() function is appropriate3.
  4. Everything after summarize() is passed to summarize() as an argument. I only chose to calculate the mean, but I also could've asked it to calculate the median and/or the standard error, among many other calculations.
    Example:
    ddply(jt_melted, c("model", "year", "variable"), summarize, avg = mean(value),
    med = median(value), se = sqrt(var(value)/length(value)))
    1. The avg part of avg = mean(value) essentially names the column in which the calculated values are output (i.e., the means are output to a column named avg).
    2. mean() is a function in the base package.
    3. value is the name of the column for which we want means calculated.

Now we just have to graph the data and make things look nice-ish.

Graphing the data generated by ddply()

To demonstrate graphing and how ggplot2 works more generally, I'll start with a basic graph and build off of it. Creating a graph and adding things to it is kind of analagous to how ggplot2 works.

Basic Scatter Chart (admittedly without much to scatter)

> library(ggplot2)
> p <- ggplot(jt_avg, aes(x = year, y = avg))
> p + geom_point()

Output:

graph of just the points

Three things to introduce here:

  1. The second line creates a ggplot object and stores it in a variable, p. Creating a ggplot object involves using the ggplot() function, which is generally used with two arguments: a data.frame and an aes
  2. aes stands for aesthetic. Aesthetics are used to specify how data should be grouped, colored, drawn, ordered, and/or positioned.
    In this case, I've specified that the year column in our jt_avg data.frame should be positioned on the x-axis and that the avg column should be positioned the y-axis.
  3. geom_point(), in the third line, is one of many geoms, or geometries. Geometries are what plot the data. geom_point() is helpful for making scatterplots. There are also geoms for making boxplots (geom_boxplot()), bar charts (geom_bar()), error bars (geom_errorbar()), histograms (geom_histogram()), smoothing curves (geom_smooth()), and lines (geom_line()), among others.

Adding Colors and Shapes

Let's mess around with aesthetics some more and add some colors and shapes so we can differentiate between the two trucks.

> p + geom_point(aes(color = model, shape = model))

Output:

graph of just the points, now with different shapes and colors for the two truck models

Using different types of plots

It's also possible to plot this as a bar chart instead of a scatter plot. Generally, geom_bar() uses stat_count(), which means it plots the number of data points in a category instead of plotting their values. In this instance, we want to plot the actual values, so we'll be using stat_identity():

> p + geom_bar(stat = "identity")

Output:

Honestly, that looks pretty terrible, so let's make it look a little worse. Like before, we'll split things up by color. Since we're comparing by models, we'll just mess with their fill:

> p + geom_bar(stat = "identity", aes(fill = model)) 

Output:

Now you'll notice that the bars are stacked, which is the default for geom_bar() — check the geom_bar() documentation, and you'll see that the default for position is stack. We don't want them stacked, so we'll use a different position. You can find all the different positions available to you by looking for functions in the ggplot2 package that start with position_ (e.g., position_dodge(), position_fill(), position_identity(), position_jitter(), position_jitterdodge(), position_nudge(), and position_stack()). See the documentation for ggplot2 in help.start() for more info on what each does. For this, we'll be using position_dodge() so that the different bars with the same X value are side by side.

> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") 

Output:

An aside

We have four bars now, but we had eight data points (our means). Where did the other four go? They blend into the higher values. If we remove one of the values, you'll see that one of the smaller values shows up. I'll modify the dataset to remove the F-150's 1999 average highway fuel economy:

> p + geom_bar(data = jt_avg[jt_avg$avg != 16.2,], stat = "identity",
aes(fill = model), position = "dodge")

Output:

Note that, since I inserted the data into geom_bar and not into the base ggplot layer, represented by p, any other plots we plot will display the unmodified data. This lets you plot multiple data sets on one chart.

> p + geom_bar(data = jt_avg[jt_avg$avg != 16.2,], stat = "identity",
aes(fill = model), position = "dodge") + geom_point()

Output:

I can also add various properties to the geom_point(), as before (e.g., color, shape) and use different positions on it. Here, I'll demonstrate the jitter position and the shapes we used before:

> p + geom_bar(data = jt_avg[jt_avg$avg != 16.2,], stat = "identity",
aes(fill = model), position = "dodge") + geom_point(aes(shape = model),
position = "jitter")

Output:

Of course, we also could've modified p's dataset like so:

> p <- ggplot(data = jt_avg[jt_avg$avg != 16.2,], aes(x = year, y = avg))

Then, re-running the command above would show that the data point is also missing from the scatter plot:

Output:

Back to your regularly scheduled program

We wanted to compare both highway and city fuel efficiency, so we either need to add another aesthetic to help differentiate between those categories, or we can use facets. I'll demonstrate adding another aes first:

> p + geom_bar(stat = "identity", aes(fill = model, color = variable), position = "dodge") 

Output:

Remember that variable was our column that had the two factors hwy and cty. The color aesthetic sets the outline color for bar charts, and fill sets the actual color of the bars.

It's also possible to use facets to show two different plots with one plot, grouped by a specified factor:

> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") +
facet_wrap(c("variable"))

Output:

You can specify two factors in facet_wrap instead of one if you wish to have a grid of plots. For example, for a 2x2 (number of variables by number of years) chart, we could separate the charts by year:

> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") +
facet_wrap(c("variable", "year"))

Output:

We'll build off the chart previous to this one for our example. First off, I don't like charts with colors for print media, so let's make the bars different shades of gray. For that, we'll use scale_fill_grey().

There are scale functions for basically every type of aesthetic (e.g., scales for alpha, color, fill, linetype, shape, size -- see help.start() documentation for all of them and what they do). scale also means something else (x and y axes), so don't confuse those functions with these. Since we're trying to change the colors for the fill, we'll use one of the scale_fill functions:

> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") +
facet_wrap(c("variable")) + scale_fill_grey()

Output:

If, for some reason, you don't like those color choices, you may also specify the colors manually. You must specify enough colors for each of your categories. We only have two categories of model (f150 and ram 1500), so I only have to specify two colors:

> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") +
facet_wrap(c("variable")) + scale_fill_manual(values = c("black", "white"))

Output:

Just remember that even when you have pretty charts, things aren't always going to be black and white. I don't particularly like that the facets are labeled "cty" and "hwy," so let's change those to "City" and "Highway". For this, we'll need to redo our base ggplot object so it reflects the changes:

> levels(jt_avg$variable) <- c("City", "Highway")
> p <- ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable")) +
scale_fill_manual(values = c("black", "white"))

Output:

We can also change the labels in the legend:

> levels(jt_avg$variable) <- c("City", "Highway")
> p <- ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable")) +
scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black", "white"))

Output:

And we can change the chart and axis titles:

> p <- ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable")) +
scale_fill_manual(labels = c("F-150", "RAM 1500"),
values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!",
x = "Model Year (year)", y = "Fuel Economy (mpg)")

Output:

It'd also be nice to display the years as if they were separate categories instead of numbers on a number line. For that, we need to set the years column as a factor and not a continuous variable:

> jt_avg$year <- as.factor(jt_avg$year)
> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable")) +
scale_fill_manual(labels = c("F-150", "RAM 1500"),
values = c("black","white")) + labs(title = "RAM 1500 Sucks!!!1!!1!",
x = "Model Year (year)", y = "Fuel Economy (mpg)")

Output:

I also want the y-axis to display just even numbers from 0 to 17. We can change the scale to display just that:

> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable")) +
scale_fill_manual(labels = c("F-150", "RAM 1500"),
values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!",
x = "Model Year (year)", y = "Fuel Economy (mpg)") +
scale_y_continuous(limits = c(0, 17), breaks = seq(0, 17, 2))

Output:

And finally, I'd like to make sure the white bars (showing the RAM 1500) are pretty much impossible to distinguish a white background, so we'll use one of the preset themes (e.g., theme_bw(), theme_classic(), theme_dark(), theme_gray(), theme_light(), theme_linedraw(), and theme_minimal()) with a white background to make a chart with a white background:

> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable"))+
scale_fill_manual(labels = c("F-150", "RAM 1500"),
values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!",
x = "Model Year (year)", y = "Fuel Economy (mpg)") +
scale_y_continuous(limits = c(0, 17), breaks = seq(0, 17, 2)) +
theme_minimal()

Output:

Actually, let's just remove the grid lines for the Y-axis so that the white completely blends in:

> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity",
aes(fill = model), position = "dodge") + facet_wrap(c("variable")) +
scale_fill_manual(labels = c("F-150", "RAM 1500"),
values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!",
x = "Model Year (year)", y = "Fuel Economy (mpg)") +
scale_y_continuous(limits = c(0, 17), breaks = seq(0, 17, 2)) +
theme_minimal() + theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank())

Output:

There are lots of other little theme tweaks one can do (e.g., panel.background to change the background color, axis.ticks.x to change how the tick lines are styled).

You may have noticed that this is a lot of code to put on one line. Because ggplot uses addition signs to add things, it's easy to store individual parts of the chart in different variables and then combine them together to break up the code, especially when you have lots of geoms involved (e.g., error bars, regression curves).

Full code (should be copy-pastable):

library(ggplot2)
library(plyr)
library(reshape2)
  
# Limits vehicles to just trucks
just_trucks <- mpg[mpg$model == "f150 pickup 4wd" | 
  mpg$model == "ram 1500 pickup 4wd",]
  
# Limits data set to just model, year, and fuel economy
just_trucks <- subset(just_trucks, select = c(model, year, cty, hwy))
  
# Aggregate the cty and hwy columns into one column
jt_melted <- melt(just_trucks, measure.vars = c("cty", "hwy"))
  
# Get the mean, median, and standard error for each model/year/fuel economy
# combination
jt_avg <- ddply(jt_melted, c("model", "year", "variable"), summarize, 
  avg = mean(value), med = median(value), se = sqrt(var(value)/length(value)))
 
# Change names of cty and hwy to City and Highway
levels(jt_avg$variable) <- c("City", "Highway")
 
# Convert year to a factor so it displays as discrete model years
jt_avg$year <- as.factor(jt_avg$year)
 
base <- ggplot(jt_avg, aes(x = year, y = avg))
bars <- geom_bar(
  aes(fill = model),
  stat = "identity",
  position = "dodge",
  color = "black") # makes the line around the bars black
 
errorbars <- geom_errorbar(
  aes(ymax = avg + se, ymin = avg - se, group = model),
  position = position_dodge(width = 1),
  width = 0.25)
 
with_labels <- base + bars + errorbars + facet_wrap(c("variable")) +
  scale_fill_manual(
    labels = c("F-150", "RAM 1500"),
    values = c("black", "white")) +
  labs(title = "RAM 1500 Sucks!!!1!!1!",
    x = "Model Year (year)",
    y = "Fuel Economy (mpg)")
 
with_even_numbers <- with_labels +
  scale_y_continuous(
    limits = c(0, 17), 
    breaks = seq(0, 17, 2))
 
a_line <- element_line(color = "black", linetype = 1, size = 0.5)
with_theme <- with_even_numbers + theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.line.x = a_line,
    axis.line.y = a_line,
    axis.ticks.x = a_line,
    axis.ticks.y = a_line,
    axis.ticks.length = unit(5, "pt"),
    legend.position = c(0.15, 0.9),
    legend.direction = "vertical",
    legend.background = element_rect(color = "black"),
    strip.text = element_text(
      family = "Arial",
      face = "bold",
      color = "black",
      size = 14)) +
  guides(fill = guide_legend(title = "Model"))
 
# Plot it
with_theme

Output:

Finding help within R

First things first: R is a Turing-complete programming language. It comes with control structures that come with most other programming languages: if, else, for, and while, amongst others. It also comes with a great help server that can be launched from the interactive console by typing help.start(). The documentation for the base and stat packages are handy as a basic overview of R's vocabulary as far as being a programming language. The search engine can be helpful for finding specific functions. In the context of ggplot2, the documentation for the ggplot2 package is extremely helpful once you know what you're looking at.

The interactive console also provides excellent avenues for accessing help documents. The ?? command is useful for searching for functions and the ? command is useful for reading the documentation. For example, in the ggplot2 package, some functions require a unit object. If I had no idea what a unit object were, I could use ??unit to find all the documentation entries that match unit. It also, helpfully, accepts regular expressions. Then, scrolling down the list of entries, I might see grid::unit, described as a "Function to Create a Unit Object". Typing ?grid::unit would then tell me how to create a unit object.

Summary

Footnotes

1. Having skimmed the book's table of contents, this sentence analogy is not quite the right way to think about it, but it'll suffice for the purposes of these notes.

2. MS Excel terminology.

3. summarize() is also provided by the plyr package. You can supply any other function for this third argument, but summarize() can already do quite a bit with the help of some other functions.