About these notes...
I'm writing these notes so I don't have to go back and re-learn how ggplot2 (and R) work(s) every time I want to use it/them. I'm putting this document online in the hope that someone else finds it helpful. This is a much more polished guide on how to use ggplot2: http://www.cookbook-r.com/Graphs/
Ggplot2 is an implementation of the Grammar of Graphics, as Leland Wilkinson describes in his book, The Grammar of Graphics. I haven't actually read the book. I just know the title. From the title, I've drawn my own assumptions about the Grammar part of Grammar of Graphics and have decided that it might be helpful to think of ggplot charts as complete sentences1. They all have a subject, a verb, and an object. I'll go over each of these parts to show how to make a chart.
Subjects are your data. Ggplot2 only speaks one language, however, so you'll likely have to convert your data into a format that ggplot2 can use before doing anything else. At a minimum, you'll likely need a column for the x-axis and another for the y-axis. If you want to make a chart of something with multiple series2, you'll probably want to put all your measured variables in one column.
The reshape2 and plyr packages are probably most helpful for this purpose. The tapply
function (in the base package) can be helpful, too, but I'd try reshape2, then plyr, then tapply
.
For this example, I'll be using the mpg data that comes with the ggplot2 package to demonstrate the use of both reshape2 and plyr.
Suppose I want to make a chart with points and lines that shows that, in 1999 and in 2008, the Ford F-150 had better fuel economy for both city and highway than did the RAM 1500.
I'm starting with a data.frame
that looks like this. The ellipsis (...) at the end just means I stopped copying there:
> library(ggplot2) > mpg manufacturer model displ year cyl trans drv cty hwy 1 audi a4 1.8 1999 4 auto(l5) f 18 29 2 audi a4 1.8 1999 4 manual(m5) f 21 29 3 audi a4 2.0 2008 4 manual(m6) f 20 31 ...
First, I need to whittle down the data, so I'm only looking at data for the F-150 and the RAM 1500. To do that, I just select by model name:
> just_trucks <- mpg[mpg$model == "f150 pickup 4wd" | mpg$model == "ram 1500 pickup 4wd",] > just_trucks manufacturer model displ year cyl trans drv cty hwy fl 65 dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 12 16 r 66 dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e 67 dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 13 17 r ...
A couple of things to point out:
mpg$model
is the same thing as mpg[["model"]]
or, less strictly, mpg[,"model"]
(again, notice the comma). They all give us the column named "model."==
) with mpg$model
and "f150 pickup 4wd"
gives us something that tells us which rows in the model
column match f150 pickup 4wd
and which ones don't.|
), means "or."To sum up, the first (and second) line tells R to give us all the rows and columns in mpg
where the model
column matches "f150 pickup 4wd"
or "ram 1500 pickup 4wd"
.
Just to demonstrate, I'll also take a subset of the columns. I just want the model
, year
, cty
, and hwy
columns. Conveniently, there's a subset()
function for doing just that:
> just_trucks <- subset(just_trucks, select = c(model, year, cty, hwy)) > just_trucks model year cty hwy 65 ram 1500 pickup 4wd 2008 12 16 66 ram 1500 pickup 4wd 2008 9 12 67 ram 1500 pickup 4wd 2008 13 17 ...
Now that I have just the data I want, I need to format it so ggplot can draw separate lines for the city and highway fuel economy ratings for each truck.
Here's where the reshape2 package comes in handy:
> library(reshape2) > jt_melted <- melt(just_trucks, measure.vars = c("cty", "hwy")) > jt_melted model year variable value 1 ram 1500 pickup 4wd 2008 cty 12 2 ram 1500 pickup 4wd 2008 cty 9 3 ram 1500 pickup 4wd 2008 cty 13 ...
Things to point out:
melt()
function is the data.frame we want to melt. In this case, it's just_trucks
.measure.vars
argument (used in measure.vars = c("cty", "hwy")
) specifies that we want the cty
and hwy
columns combined.melt()
function took the cty
and hwy
columns and combined their values into one column called value
. It also made another column, called variable
, for indicating which value we're talking about, exactly (city or highway MPGs).This combining makes it easier to draw different lines for city and highway MPGs on the graph/chart.
You can also specify which columns should not have their values consolidated by using id.vars
, and if you wish to rename the value
and variable
columns, you can use the value.name
and variable.name
arguments, respectively.
melt(just_trucks, id.vars = c("model", "year"), measure.vars = c("cty", "hwy"), value.name = "mpg_ratings", variable.name = "which_mpg")
In the context of ggplot, it's hard to demonstrate verbs without showing their objects (charts), so this next section includes a little bit of both.
Verbs are what you do to transform your data before using them to make a chart. In this case, I want to take average mpgs for each truck and year for both their city and highway fuel economy ratings. Hopefully, doing so will ensure that the F-150 appears to have better fuel economy ratings than the RAM 1500.
There are a couple of ways to take these averages:
plyr
package includes the summarize()
and ddply()
functions that can help transform my data.ggplot2
package provides some verbs (statistics functions) that can modify the data before they are displayed.The first way makes for an easier transition into explaining how to make a chart (which is what these notes are for in the first place), so I'll show how to make a chart with the first method and come back to the second method at the end.
ddply()
function in the plyr
packageThis first method is fairly straightforward. We load the plyr
package and make a new data.frame
with the averages we want:
> library(plyr) > jt_avg <- ddply(jt_melted, c("model", "year", "variable"), summarize, avg = mean(value)) > jt_avg model year variable avg 1 f150 pickup 4wd 1999 cty 13.00000 2 f150 pickup 4wd 1999 hwy 16.20000 3 f150 pickup 4wd 2008 cty 13.00000 4 f150 pickup 4wd 2008 hwy 17.00000 5 ram 1500 pickup 4wd 1999 cty 11.00000 6 ram 1500 pickup 4wd 1999 hwy 15.33333 7 ram 1500 pickup 4wd 2008 cty 11.57143 8 ram 1500 pickup 4wd 2008 hwy 15.28571
Here, I've supplied the ddply()
function with three arguments. It looks like four, but it's actually three:
data.frame
we care about.summarize()
function is appropriate3.summarize()
is passed to summarize()
as an argument. I only chose to calculate the mean, but I also could've asked it to calculate the median and/or the standard error, among many other calculations.ddply(jt_melted, c("model", "year", "variable"), summarize, avg = mean(value), med = median(value), se = sqrt(var(value)/length(value)))
avg
part of avg = mean(value)
essentially names the column in which the calculated values are output (i.e., the means are output to a column named avg
).mean()
is a function in the base package.value
is the name of the column for which we want means calculated.Now we just have to graph the data and make things look nice-ish.
ddply()
To demonstrate graphing and how ggplot2 works more generally, I'll start with a basic graph and build off of it. Creating a graph and adding things to it is kind of analagous to how ggplot2 works.
> library(ggplot2) > p <- ggplot(jt_avg, aes(x = year, y = avg)) > p + geom_point()
Output:
Three things to introduce here:
p
. Creating a ggplot object involves using the ggplot()
function, which is generally used with two arguments: a data.frame
and an aes
aes
stands for aesthetic. Aesthetics are used to specify how data should be grouped, colored, drawn, ordered, and/or positioned.
year
column in our jt_avg
data.frame should be positioned on the x-axis and that the avg
column should be positioned the y-axis.geom_point()
, in the third line, is one of many geom
s, or geometries. Geometries are what plot the data. geom_point()
is helpful for making scatterplots. There are also geom
s for making boxplots (geom_boxplot()
), bar charts (geom_bar()
), error bars (geom_errorbar()
), histograms (geom_histogram()
), smoothing curves (geom_smooth()
), and lines (geom_line()
), among others.Let's mess around with aesthetics some more and add some colors and shapes so we can differentiate between the two trucks.
> p + geom_point(aes(color = model, shape = model))
Output:
It's also possible to plot this as a bar chart instead of a scatter plot. Generally, geom_bar() uses stat_count(), which means it plots the number of data points in a category instead of plotting their values. In this instance, we want to plot the actual values, so we'll be using stat_identity():
> p + geom_bar(stat = "identity")
Output:
Honestly, that looks pretty terrible, so let's make it look a little worse. Like before, we'll split things up by color. Since we're comparing by models, we'll just mess with their fill:
> p + geom_bar(stat = "identity", aes(fill = model))
Output:
Now you'll notice that the bars are stacked, which is the default for geom_bar() — check the geom_bar() documentation, and you'll see that the default for position is stack. We don't want them stacked, so we'll use a different position. You can find all the different positions available to you by looking for functions in the ggplot2 package that start with position_ (e.g., position_dodge(), position_fill(), position_identity(), position_jitter(), position_jitterdodge(), position_nudge(), and position_stack()). See the documentation for ggplot2 in help.start() for more info on what each does. For this, we'll be using position_dodge() so that the different bars with the same X value are side by side.
> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge")
Output:
We have four bars now, but we had eight data points (our means). Where did the other four go? They blend into the higher values. If we remove one of the values, you'll see that one of the smaller values shows up. I'll modify the dataset to remove the F-150's 1999 average highway fuel economy:
> p + geom_bar(data = jt_avg[jt_avg$avg != 16.2,], stat = "identity", aes(fill = model), position = "dodge")
Output:
Note that, since I inserted the data into geom_bar and not into the base ggplot layer, represented by p, any other plots we plot will display the unmodified data. This lets you plot multiple data sets on one chart.
> p + geom_bar(data = jt_avg[jt_avg$avg != 16.2,], stat = "identity", aes(fill = model), position = "dodge") + geom_point()
Output:
I can also add various properties to the geom_point(), as before (e.g., color, shape) and use different positions on it. Here, I'll demonstrate the jitter position and the shapes we used before:
> p + geom_bar(data = jt_avg[jt_avg$avg != 16.2,], stat = "identity", aes(fill = model), position = "dodge") + geom_point(aes(shape = model), position = "jitter")
Output:
Of course, we also could've modified p's dataset like so:
> p <- ggplot(data = jt_avg[jt_avg$avg != 16.2,], aes(x = year, y = avg))
Then, re-running the command above would show that the data point is also missing from the scatter plot:
Output:
We wanted to compare both highway and city fuel efficiency, so we either need to add another aesthetic to help differentiate between those categories, or we can use facets. I'll demonstrate adding another aes first:
> p + geom_bar(stat = "identity", aes(fill = model, color = variable), position = "dodge")
Output:
Remember that variable was our column that had the two factors hwy
and cty
. The color
aesthetic sets the outline color for bar charts, and fill sets the actual color of the bars.
It's also possible to use facets to show two different plots with one plot, grouped by a specified factor:
> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable"))
Output:
You can specify two factors in facet_wrap instead of one if you wish to have a grid of plots. For example, for a 2x2 (number of variables by number of years) chart, we could separate the charts by year:
> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable", "year"))
Output:
We'll build off the chart previous to this one for our example. First off, I don't like charts with colors for print media, so let's make the bars different shades of gray. For that, we'll use scale_fill_grey()
.
There are scale
functions for basically every type of aesthetic (e.g., scales for alpha, color, fill, linetype, shape, size -- see help.start() documentation for all of them and what they do). scale
also means something else (x and y axes), so don't confuse those functions with these. Since we're trying to change the colors for the fill, we'll use one of the scale_fill
functions:
> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_grey()
Output:
If, for some reason, you don't like those color choices, you may also specify the colors manually. You must specify enough colors for each of your categories. We only have two categories of model (f150 and ram 1500), so I only have to specify two colors:
> p + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(values = c("black", "white"))
Output:
Just remember that even when you have pretty charts, things aren't always going to be black and white. I don't particularly like that the facets are labeled "cty" and "hwy," so let's change those to "City" and "Highway". For this, we'll need to redo our base ggplot object so it reflects the changes:
> levels(jt_avg$variable) <- c("City", "Highway") > p <- ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(values = c("black", "white"))
Output:
We can also change the labels in the legend:
> levels(jt_avg$variable) <- c("City", "Highway") > p <- ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black", "white"))
Output:
And we can change the chart and axis titles:
> p <- ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!", x = "Model Year (year)", y = "Fuel Economy (mpg)")
Output:
It'd also be nice to display the years as if they were separate categories instead of numbers on a number line. For that, we need to set the years column as a factor and not a continuous variable:
> jt_avg$year <- as.factor(jt_avg$year) > ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black","white")) + labs(title = "RAM 1500 Sucks!!!1!!1!", x = "Model Year (year)", y = "Fuel Economy (mpg)")
Output:
I also want the y-axis to display just even numbers from 0 to 17. We can change the scale to display just that:
> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!", x = "Model Year (year)", y = "Fuel Economy (mpg)") + scale_y_continuous(limits = c(0, 17), breaks = seq(0, 17, 2))
Output:
And finally, I'd like to make sure the white bars (showing the RAM 1500) are pretty much impossible to distinguish a white background, so we'll use one of the preset themes (e.g., theme_bw(), theme_classic(), theme_dark(), theme_gray(), theme_light(), theme_linedraw(), and theme_minimal()) with a white background to make a chart with a white background:
> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable"))+ scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!", x = "Model Year (year)", y = "Fuel Economy (mpg)") + scale_y_continuous(limits = c(0, 17), breaks = seq(0, 17, 2)) + theme_minimal()
Output:
Actually, let's just remove the grid lines for the Y-axis so that the white completely blends in:
> ggplot(jt_avg, aes(x = year, y = avg) + geom_bar(stat = "identity", aes(fill = model), position = "dodge") + facet_wrap(c("variable")) + scale_fill_manual(labels = c("F-150", "RAM 1500"), values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!", x = "Model Year (year)", y = "Fuel Economy (mpg)") + scale_y_continuous(limits = c(0, 17), breaks = seq(0, 17, 2)) + theme_minimal() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Output:
There are lots of other little theme tweaks one can do (e.g., panel.background to change the background color, axis.ticks.x to change how the tick lines are styled).
You may have noticed that this is a lot of code to put on one line. Because ggplot uses addition signs to add things, it's easy to store individual parts of the chart in different variables and then combine them together to break up the code, especially when you have lots of geoms involved (e.g., error bars, regression curves).
Full code (should be copy-pastable):
library(ggplot2) library(plyr) library(reshape2) # Limits vehicles to just trucks just_trucks <- mpg[mpg$model == "f150 pickup 4wd" | mpg$model == "ram 1500 pickup 4wd",] # Limits data set to just model, year, and fuel economy just_trucks <- subset(just_trucks, select = c(model, year, cty, hwy)) # Aggregate the cty and hwy columns into one column jt_melted <- melt(just_trucks, measure.vars = c("cty", "hwy")) # Get the mean, median, and standard error for each model/year/fuel economy # combination jt_avg <- ddply(jt_melted, c("model", "year", "variable"), summarize, avg = mean(value), med = median(value), se = sqrt(var(value)/length(value))) # Change names of cty and hwy to City and Highway levels(jt_avg$variable) <- c("City", "Highway") # Convert year to a factor so it displays as discrete model years jt_avg$year <- as.factor(jt_avg$year) base <- ggplot(jt_avg, aes(x = year, y = avg)) bars <- geom_bar( aes(fill = model), stat = "identity", position = "dodge", color = "black") # makes the line around the bars black errorbars <- geom_errorbar( aes(ymax = avg + se, ymin = avg - se, group = model), position = position_dodge(width = 1), width = 0.25) with_labels <- base + bars + errorbars + facet_wrap(c("variable")) + scale_fill_manual( labels = c("F-150", "RAM 1500"), values = c("black", "white")) + labs(title = "RAM 1500 Sucks!!!1!!1!", x = "Model Year (year)", y = "Fuel Economy (mpg)") with_even_numbers <- with_labels + scale_y_continuous( limits = c(0, 17), breaks = seq(0, 17, 2)) a_line <- element_line(color = "black", linetype = 1, size = 0.5) with_theme <- with_even_numbers + theme_minimal() + theme( panel.grid.major = element_blank(), panel.grid.minor = element_blank(), axis.line.x = a_line, axis.line.y = a_line, axis.ticks.x = a_line, axis.ticks.y = a_line, axis.ticks.length = unit(5, "pt"), legend.position = c(0.15, 0.9), legend.direction = "vertical", legend.background = element_rect(color = "black"), strip.text = element_text( family = "Arial", face = "bold", color = "black", size = 14)) + guides(fill = guide_legend(title = "Model")) # Plot it with_theme
Output:
First things first: R is a Turing-complete programming language. It comes with control structures that come with most other programming languages: if, else, for, and while, amongst others. It also comes with a great help server that can be launched from the interactive console by typing help.start()
. The documentation for the base and stat packages are handy as a basic overview of R's vocabulary as far as being a programming language. The search engine can be helpful for finding specific functions. In the context of ggplot2, the documentation for the ggplot2 package is extremely helpful once you know what you're looking at.
The interactive console also provides excellent avenues for accessing help documents. The ??
command is useful for searching for functions and the ?
command is useful for reading the documentation. For example, in the ggplot2 package, some functions require a unit object. If I had no idea what a unit object were, I could use ??unit
to find all the documentation entries that match unit
. It also, helpfully, accepts regular expressions. Then, scrolling down the list of entries, I might see grid::unit, described as a "Function to Create a Unit Object". Typing ?grid::unit
would then tell me how to create a unit object.
1. Having skimmed the book's table of contents, this sentence analogy is not quite the right way to think about it, but it'll suffice for the purposes of these notes.
2. MS Excel terminology.
3. summarize()
is also provided by the plyr
package. You can supply any other function for this third argument, but summarize()
can already do quite a bit with the help of some other functions.